Complementary DNAs

ABSTRACT

The sequences of extended cDNAs encoding secreted proteins are disclosed. The extended cDNAs can be used to express secreted proteins or portions thereof or to obtain antibodies capable of specifically binding to the secreted proteins. The extended cDNAs may also be used in diagnostic, forensic, gene therapy, and chromosome mapping procedures. The extended cDNAs may also be used to design expression vectors and secretion vectors.

RELATED APPLICATIONS

The present application is a divisional of U.S. patent application Ser.No. 10/930,331, filed Aug. 30, 2004, which is a divisional applicationof U.S. patent application Ser. No. 09/903,190, filed Jul. 11, 2001, nowU.S. Pat. No. 6,936,692, which is a divisional application of U.S.patent application Ser. No. 09/247,155, filed Feb. 9, 1999, now U.S.Pat. No. 6,312,922, which claims the benefit of U.S. Provisional Patentapplication Ser. No. 60/074,121, filed Feb. 9, 1998, U.S. ProvisionalPatent application Ser. No. 60/081,563, filed Apr. 13, 1998, U.S.Provisional Patent application Ser. No. 60/096,116, filed Aug. 10, 1998,and U.S. Provisional Patent application Ser. No. 60/099,273, filed Sep.4, 1998, the disclosures of which are incorporated herein by referencein their entirety.

Table I lists the SEQ ID Nos. of the extended cDNAs in the presentapplication, the SEQ ID Nos. of the extended cDNAs in the provisionalapplications, and the identities of the provisional applications inwhich the extended cDNAs were disclosed.

The Sequence Listing for this application is on duplicate compact discslabeled “Copy 1” and “Copy 2.” Copy 1 and Copy 2 each contain only onefile named “SEQ LIST-36US3.txt” which was created on Dec. 10, 2002, andis 330 KB. The entire contents of each of the computer discs areincorporated herein by reference in their entireties.

BACKGROUND OF THE INVENTION

The estimated 50,000 genes scattered along the human chromosomes offertremendous promise for the understanding, diagnosis, and treatment ofhuman diseases. In addition, probes capable of specifically hybridizingto loci distributed throughout the human genome find applications in theconstruction of high resolution chromosome maps and in theidentification of individuals.

In the past, the characterization of even a single human gene was apainstaking process, requiring years of effort. Recent developments inthe areas of cloning vectors, DNA sequencing, and computer technologyhave merged to greatly accelerate the rate at which human genes can beisolated, sequenced, mapped, and characterized. Cloning vectors such asyeast artificial chromosomes (YACs) and bacterial artificial chromosomes(BACs) are able to accept DNA inserts ranging from 300 to 1000 kilobases(kb) or 100-400 kb in length respectively, thereby facilitating themanipulation and ordering of DNA sequences distributed over greatdistances on the human chromosomes. Automated DNA sequencing machinespermit the rapid sequencing of human genes. Bioinformatics softwareenables the comparison of nucleic acid and protein sequences, therebyassisting in the characterization of human gene products.

Currently, two different approaches are being pursued for identifyingand characterizing the genes distributed along the human genome. In oneapproach, large fragments of genomic DNA are isolated, cloned, andsequenced. Potential open reading frames in these genomic sequences areidentified using bio-informatics software. However, this approachentails sequencing large stretches of human DNA which do not encodeproteins in order to find the protein encoding sequences scatteredthroughout the genome. In addition to requiring extensive sequencing,the bio-informatics software may mischaracterize the genomic sequencesobtained. Thus, the software may produce false positives in whichnon-coding DNA is mischaracterized as coding DNA or false negatives inwhich coding DNA is mislabeled as non-coding DNA.

An alternative approach takes a more direct route to identifying andcharacterizing human genes. In this approach, complementary DNAs (cDNAs)are synthesized from isolated messenger RNAs (mRNAs) which encode humanproteins. Using this approach, sequencing is only performed on DNA whichis derived from protein coding portions of the genome. Often, only shortstretches of the cDNAs are sequenced to obtain sequences calledexpressed sequence tags (ESTs). The ESTs may then be used to isolate orpurify extended cDNAs which include sequences adjacent to the ESTsequences. The extended cDNAs may contain all of the sequence of the ESTwhich was used to obtain them or only a portion of the sequence of theEST which was used to obtain them. In addition, the extended cDNAs maycontain the full coding sequence of the gene from which the EST wasderived or, alternatively, the extended cDNAs may include portions ofthe coding sequence of the gene from which the EST was derived. It willbe appreciated that there may be several extended cDNAs which includethe EST sequence as a result of alternate splicing or the activity ofalternative promoters.

In the past, the short EST sequences which could be used to isolate orpurify extended cDNAs were often obtained from oligo-dT primed cDNAlibraries. Accordingly, they mainly corresponded to the 3′ untranslatedregion of the mRNA. In part, the prevalence of EST sequences derivedfrom the 3′ end of the mRNA is a result of the fact that typicaltechniques for obtaining cDNAs, are not well suited for isolating cDNAsequences derived from the 5′ ends of mRNAs. (Adams et al., Nature377:174, 1996, Hillier et al., Genome Res. 6:807-828, 1996).

In addition, in those reported instances where longer cDNA sequenceshave been obtained, the reported sequences typically correspond tocoding sequences and do not include the full 5′ untranslated region ofthe mRNA from which the cDNA is derived. Such incomplete sequences maynot include the first exon of the mRNA, particularly in situations wherethe first exon is short. Furthermore, they may not include some exons,often short ones, which are located upstream of splicing sites. Thus,there is a need to obtain sequences derived from the 5′ ends of mRNAswhich can be used to obtain extended cDNAs which may include the 5′sequences contained in the 5′ ESTs.

While many sequences derived from human chromosomes have practicalapplications, approaches based on the identification andcharacterization of those chromosomal sequences which encode a proteinproduct are particularly relevant to diagnostic and therapeutic uses. Ofthe 50,000 protein coding genes, those genes encoding proteins which aresecreted from the cell in which they are synthesized, as well as thesecreted proteins themselves, are particularly valuable as potentialtherapeutic agents. Such proteins are often involved in cell to cellcommunication and may be responsible for producing a clinically relevantresponse in their target cells.

In fact, several secretory proteins, including tissue plasminogenactivator, G-CSF, GM-CSF, erythropoietin, human growth hormone, insulin,interferon-α, interferon-β, interferon-γ, and interleukin-2, arecurrently in clinical use. These proteins are used to treat a wide rangeof conditions, including acute myocardial infarction, acute ischemicstroke, anemia, diabetes, growth hormone deficiency, hepatitis, kidneycarcinoma, chemotherapy induced neutropenia and multiple sclerosis. Forthese reasons, extended cDNAs encoding secreted proteins or portionsthereof represent a particularly valuable source of therapeutic agents.Thus, there is a need for the identification and characterization ofsecreted proteins and the nucleic acids encoding them.

In addition to being therapeutically useful themselves, secretoryproteins include short peptides, called signal peptides, at their aminotermini which direct their secretion. These signal peptides are encodedby the signal sequences located at the 5′ ends of the coding sequencesof genes encoding secreted proteins. Because these signal peptides willdirect the extracellular secretion of any protein to which they areoperably linked, the signal sequences may be exploited to direct theefficient secretion of any protein by operably linking the signalsequences to a gene encoding the protein for which secretion is desired.This may prove beneficial in gene therapy strategies in which it isdesired to deliver a particular gene product to cells other than thecell in which it is produced. Signal sequences encoding signal peptidesalso find application in simplifying protein purification techniques. Insuch applications, the extracellular secretion of the desired proteingreatly facilitates purification by reducing the number of undesiredproteins from which the desired protein must be selected. Thus, thereexists a need to identify and characterize the 5′ portions of the genesfor secretory proteins which encode signal peptides.

Public information on the number of human genes for which the promotersand upstream regulatory regions have been identified and characterizedis quite limited. In part, this may be due to the difficulty ofisolating such regulatory sequences. Upstream regulatory sequences suchas transcription factor binding sites are typically too short to beutilized as probes for isolating promoters from human genomic libraries.Recently, some approaches have been developed to isolate humanpromoters. One of them consists of making a CpG island library (Cross,S. H. et al., Purification of CpG Islands using a Methylated DNA BindingColumn, Nature Genetics 6: 236-244 (1994)). The second consists ofisolating human genomic DNA sequences containing SpeI binding sites bythe use of SpeI binding protein. (Mortlock et al., Genome Res.6:327-335, 1996). Both of these approaches have their limits due to alack of specificity or of comprehensiveness.

5′ ESTs and extended cDNAs obtainable therefrom may be used toefficiently identify and isolate upstream regulatory regions whichcontrol the location, developmental stage, rate, and quantity of proteinsynthesis, as well as the stability of the mRNA. (Theil et al.,BioFactors 4:87-93, (1993). Once identified and characterized, theseregulatory regions may be utilized in gene therapy or proteinpurification schemes to obtain the desired amount and locations ofprotein synthesis or to inhibit, reduce, or prevent the synthesis ofundesirable gene products.

In addition, ESTs containing the 5′ ends of secretory protein genes orextended cDNAs which include sequences adjacent to the sequences of theESTs may include sequences useful as probes for chromosome mapping andthe identification of individuals. Thus, there is a need to identify andcharacterize the sequences upstream of the 5′ coding sequences of genesencoding secretory proteins.

SUMMARY OF THE INVENTION

The present invention relates to purified, isolated, or recombinantextended cDNAs which encode secreted proteins or fragments thereof.Preferably, the purified, isolated or recombinant cDNAs contain theentire open reading frame of their corresponding mRNAs, including astart codon and a stop codon. For example, the extended cDNAs mayinclude nucleic acids encoding the signal peptide as well as the matureprotein. Alternatively, the extended cDNAs may contain a fragment of theopen reading frame. In some embodiments, the fragment may encode onlythe sequence of the mature protein. Alternatively, the fragment mayencode only a portion of the mature protein. A further aspect of thepresent invention is a nucleic acid which encodes the signal peptide ofa secreted protein.

The present extended cDNAs were obtained using ESTs which includesequences derived from the authentic 5′ ends of their correspondingmRNAs. As used herein the terms “EST” or “5′ EST” refer to the shortcDNAs which were used to obtain the extended cDNAs of the presentinvention. As used herein, the term “extended cDNA” refers to the cDNAswhich include sequences adjacent to the 5′ EST used to obtain them. Theextended cDNAs may contain all or a portion of the sequence of the ESTwhich was used to obtain them. The term “corresponding mRNA” refers tothe mRNA which was the template for the cDNA synthesis which producedthe 5′ EST. As used herein, the term “purified” does not requireabsolute purity; rather, it is intended as a relative definition.Individual extended cDNA clones isolated from a cDNA library have beenconventionally purified to electrophoretic homogeneity. The sequencesobtained from these clones could not be obtained directly either fromthe library or from total human DNA. The extended cDNA clones are notnaturally occurring as such, but rather are obtained via manipulation ofa partially purified naturally occurring substance (messenger RNA). Theconversion of mRNA into a cDNA library involves the creation of asynthetic substance (cDNA) and pure individual cDNA clones can beisolated from the synthetic library by clonal selection. Thus, creatinga cDNA library from messenger RNA and subsequently isolating individualclones from that library results in an approximately 10⁴-10⁶ foldpurification of the native message. Purification of starting material ornatural material to at least one order of magnitude, preferably two orthree orders, and more preferably four or five orders of magnitude isexpressly contemplated.

As used herein, the term “isolated” requires that the material beremoved from its original environment (e.g., the natural environment ifit is naturally occurring). For example, a naturally-occurringpolynucleotide present in a living animal is not isolated, but the samepolynucleotide, separated from some or all of the coexisting materialsin the natural system, is isolated.

As used herein, the term “recombinant” means that the extended cDNA isadjacent to “backbone” nucleic acid to which it is not adjacent in itsnatural environment. Additionally, to be “enriched” the extended cDNAswill represent 5% or more of the number of nucleic acid inserts in apopulation of nucleic acid backbone molecules. Backbone moleculesaccording to the present invention include nucleic acids such asexpression vectors, self-replicating nucleic acids, viruses, integratingnucleic acids, and other vectors or nucleic acids used to maintain ormanipulate a nucleic acid insert of interest. Preferably, the enrichedextended cDNAs represent 15% or more of the number of nucleic acidinserts in the population of recombinant backbone molecules. Morepreferably, the enriched extended cDNAs represent 50% or more of thenumber of nucleic acid inserts in the population of recombinant backbonemolecules. In a highly preferred embodiment, the enriched extended cDNAsrepresent 90% or more of the number of nucleic acid inserts in thepopulation of recombinant backbone molecules. “Stringent”, “moderate,”and “low” hybridization conditions are as defined in Example 29.

Unless otherwise indicated, a “complementary” sequence is fullycomplementary. Thus, extended cDNAs encoding secreted polypeptides orfragments thereof which are present in cDNA libraries in which one ormore extended cDNAs encoding secreted polypeptides or fragments thereofmake up 5% or more of the number of nucleic acid inserts in the backbonemolecules are “enriched recombinant extended cDNAs” as defined herein.Likewise, extended cDNAs encoding secreted polypeptides or fragmentsthereof which are in a population of plasmids in which one or moreextended cDNAs of the present invention have been inserted such thatthey represent 5% or more of the number of inserts in the plasmidbackbone are “enriched recombinant extended cDNAs” as defined herein.However, extended cDNAs encoding secreted polypeptides or fragmentsthereof which are in cDNA libraries in which the extended cDNAs encodingsecreted polypeptides or fragments thereof constitute less than 5% ofthe number of nucleic acid inserts in the population of backbonemolecules, such as libraries in which backbone molecules having a cDNAinsert encoding a secreted polypeptide are extremely rare, are not“enriched recombinant extended cDNAs.”

In particular, the present invention relates to extended cDNAs whichwere derived from genes encoding secreted proteins. As used herein, a“secreted” protein is one which, when expressed in a suitable host cell,is transported across or through a membrane, including transport as aresult of signal peptides in its amino acid sequence. “Secreted”proteins include without limitation proteins secreted wholly (e.g.soluble proteins), or partially (e.g. receptors) from the cell in whichthey are expressed. “Secreted” proteins also include without limitationproteins which are transported across the membrane of the endoplasmicreticulum.

Extended cDNAs encoding secreted proteins may include nucleic acidsequences, called signal sequences, which encode signal peptides whichdirect the extracellular secretion of the proteins encoded by theextended cDNAs. Generally, the signal peptides are located at the aminotermini of secreted proteins.

Secreted proteins are translated by ribosomes associated with the“rough” endoplasmic reticulum. Generally, secreted proteins areco-translationally transferred to the membrane of the endoplasmicreticulum. Association of the ribosome with the endoplasmic reticulumduring translation of secreted proteins is mediated by the signalpeptide. The signal peptide is typically cleaved following itsco-translational entry into the endoplasmic reticulum. After delivery tothe endoplasmic reticulum, secreted proteins may proceed through theGolgi apparatus. In the Golgi apparatus, the proteins may undergopost-translational modification before entering secretory vesicles whichtransport them across the cell membrane.

The extended cDNAs of the present invention have several importantapplications. For example, they may be used to express the entiresecreted protein which they encode. Alternatively, they may be used toexpress portions of the secreted protein. The portions may comprise thesignal peptides encoded by the extended cDNAs or the mature proteinsencoded by the extended cDNAs (i.e. the proteins generated when thesignal peptide is cleaved off). The portions may also comprisepolypeptides having at least 10 consecutive amino acids encoded by theextended cDNAs. Alternatively, the portions may comprise at least 15consecutive amino acids encoded by the extended cDNAs. In someembodiments, the portions may comprise at least 25 consecutive aminoacids encoded by the extended cDNAs. In other embodiments, the portionsmay comprise at least 40 amino acids encoded by the extended cDNAs.

Antibodies which specifically recognize the entire secreted proteinsencoded by the extended cDNAs or fragments thereof having at least 10consecutive amino acids, at least 15 consecutive amino acids, at least25 consecutive amino acids, or at least 40 consecutive amino acids mayalso be obtained as described below. Antibodies which specificallyrecognize the mature protein generated when the signal peptide iscleaved may also be obtained as described below. Similarly, antibodieswhich specifically recognize the signal peptides encoded by the extendedcDNAs may also be obtained.

In some embodiments, the extended cDNAs include the signal sequence. Inother embodiments, the extended cDNAs may include the full codingsequence for the mature protein (i.e. the protein generated when thesignal polypeptide is cleaved off). In addition, the extended cDNAs mayinclude regulatory regions upstream of the translation start site ordownstream of the stop codon which control the amount, location, ordevelopmental stage of gene expression. As discussed above, secretedproteins are therapeutically important. Thus, the proteins expressedfrom the cDNAs may be useful in treating or controlling a variety ofhuman conditions. The extended cDNAs may also be used to obtain thecorresponding genomic DNA. The term “corresponding genomic DNA” refersto the genomic DNA which encodes mRNA which includes the sequence of oneof the strands of the extended cDNA in which thymidine residues in thesequence of the extended cDNA are replaced by uracil residues in themRNA.

The extended cDNAs or genomic DNAs obtained therefrom may be used inforensic procedures to identify individuals or in diagnostic proceduresto identify individuals having genetic diseases resulting from abnormalexpression of the genes corresponding to the extended cDNAs. Inaddition, the present invention is useful for constructing a highresolution map of the human chromosomes.

The present invention also relates to secretion vectors capable ofdirecting the secretion of a protein of interest. Such vectors may beused in gene therapy strategies in which it is desired to produce a geneproduct in one cell which is to be delivered to another location in thebody. Secretion vectors may also facilitate the purification of desiredproteins.

The present invention also relates to expression vectors capable ofdirecting the expression of an inserted gene in a desired spatial ortemporal manner or at a desired level. Such vectors may includesequences upstream of the extended cDNAs such as promoters or upstreamregulatory sequences.

In addition, the present invention may also be used for gene therapy tocontrol or treat genetic diseases. Signal peptides may also be fused toheterologous proteins to direct their extracellular secretion.

One embodiment of the present invention is a purified or isolatednucleic acid comprising the sequence of one of SEQ ID NOs: 40-84 and130-154 or a sequence complementary thereto. In one aspect of thisembodiment, the nucleic acid is recombinant.

Another embodiment of the present invention is a purified or isolatednucleic acid comprising at least 10 consecutive bases of the sequence ofone of SEQ ID NOs: 40-84 and 130-154 or one of the sequencescomplementary thereto. In one aspect of this embodiment, the nucleicacid comprises at least 15, 25, 30, 40, 50, 75, or 100 consecutive basesof one of the sequences of SEQ ID NOs: 40-84 and 130-154 or one of thesequences complementary thereto. The nucleic acid may be a recombinantnucleic acid.

Another embodiment of the present invention is a purified or isolatednucleic acid of at least 15 bases capable of hybridizing under stringentconditions to the sequence of one of SEQ ID NOs: 40-84 and 130-154 or asequence complementary to one of the sequences of SEQ ID NOs: 40-84 and130-154. In one aspect of this embodiment, the nucleic acid isrecombinant.

Another embodiment of the present invention is a purified or isolatednucleic acid comprising the full coding sequences of one of SEQ ID Nos:40-84 and 130-154 wherein the full coding sequence optionally comprisesthe sequence encoding signal peptide as well as the sequence encodingmature protein. In a preferred embodiment, the isolated or purifiednucleic acid comprises the full coding sequence of one of SEQ ID Nos.40-59, 61-73, 75, 77-82, and 130-154 wherein the full coding sequencecomprises the sequence encoding signal peptide and the sequence encodingmature protein. In one aspect of this embodiment, the nucleic acid isrecombinant.

A further embodiment of the present invention is a purified or isolatednucleic acid comprising the nucleotides of one of SEQ ID NOs: 40-84 and130-154 which encode a mature protein. In a preferred embodiment, thepurified or isolated nucleic acid comprises the nucleotides of one ofSEQ ID NOs: 40-59, 61-75, 77-82, and 130-154 which encode a matureprotein. In one aspect of this embodiment, the nucleic acid isrecombinant.

Yet another embodiment of the present invention is a purified orisolated nucleic acid comprising the nucleotides of one of SEQ ID NOs:40-84 and 130-154 which encode the signal peptide. In a preferredembodiment, the purified or isolated nucleic acid comprises thenucleotides of SEQ ID NOs: 40-59, 61-73, 75-82, 84, and 130-154 whichencode the signal peptide. In one aspect of this embodiment, the nucleicacid is recombinant.

Another embodiment of the present invention is a purified or isolatednucleic acid encoding a polypeptide having the sequence of one of thesequences of SEQ ID NOs: 85-129 and 155-179.

Another embodiment of the present invention is a purified or isolatednucleic acid encoding a polypeptide having the sequence of a matureprotein included in one of the sequences of SEQ ID NOs: 85-129 and155-179. In a preferred embodiment, the purified or isolated nucleicacid encodes a polypeptide having the sequence of a mature proteinincluded in one of the sequences of SEQ ID NOs: 85-104, 106-120,122-127, and 155-179.

Another embodiment of the present invention is a purified or isolatednucleic acid encoding a polypeptide having the sequence of a signalpeptide included in one of the sequences of SEQ ID NOs: 85-129 and155-179. In a preferred embodiment, the purified or isolated nucleicacid encodes a polypeptide having the sequence of a signal peptideincluded in one of the sequences of SEQ ID NOs: 85-104, 106-118,120-127, 129, and 155-179.

Yet another embodiment of the present invention is a purified orisolated protein comprising the sequence of one of SEQ ID NOs: 85-129and 155-179.

Another embodiment of the present invention is a purified or isolatedpolypeptide comprising at least 10 consecutive amino acids of one of thesequences of SEQ ID NOs: 85-129 and 155-179. In one aspect of thisembodiment, the purified or isolated polypeptide comprises at least 15,20, 25, 35, 50, 75, 100, 150 or 200 consecutive amino acids of one ofthe sequences of SEQ ID NOs: 85-129 and 155-179. In still anotheraspect, the purified or isolated polypeptide comprises at least 25consecutive amino acids of one of the sequences of SEQ ID NOs: 85-129and 155-179.

Another embodiment of the present invention is an isolated or purifiedpolypeptide comprising a signal peptide of one of the polypeptides ofSEQ ID NOs: 85-129 and 155-179. In a preferred embodiment, the isolatedor purified polypeptide comprises a signal peptide of one of thepolypeptides of SEQ ID NOs: 85-104, 106-118, 120-127, 129, and 155-179.

Yet another embodiment of the present invention is an isolated orpurified polypeptide comprising a mature protein of one of thepolypeptides of SEQ ID NOs: 85-129 and 155-179. In a preferredembodiment, the isolated or purified polypeptide comprises a matureprotein of one of the polypeptides of SEQ ID NOs: 85-104, 106-120,122-127, and 155-179. In a preferred embodiment, the purified orisolated nucleic acid encodes a polypeptide having the sequence of amature protein included in one of the sequences of SEQ ID NOs: 85-104,106-120, 122-127, and 155-179.

A further embodiment of the present invention is a method of making aprotein comprising one of the sequences of SEQ ID NO: 85-129 and155-179, comprising the steps of obtaining a cDNA comprising one of thesequences of sequence of SEQ ID NO: 40-84 and 130-154, inserting thecDNA in an expression vector such that the cDNA is operably linked to apromoter, and introducing the expression vector into a host cell wherebythe host cell produces the protein encoded by said cDNA. In one aspectof this embodiment, the method further comprises the step of isolatingthe protein.

Another embodiment of the present invention is a protein obtainable bythe method described in the preceding paragraph.

Another embodiment of the present invention is a method of making aprotein comprising the amino acid sequence of the mature proteincontained in one of the sequences of SEQ ID NOs: 85-104, 106-120,122-127, and 155-179 comprising the steps of obtaining a cDNA comprisingone of the nucleotides sequence of sequence of SEQ ID NOs: 40-59, 61-75,77-82, and 130-154 which encode for the mature protein, inserting thecDNA in an expression vector such that the cDNA is operably linked to apromoter, and introducing the expression vector into a host cell wherebythe host cell produces the mature protein encoded by the cDNA. In oneaspect of this embodiment, the method further comprises the step ofisolating the protein.

Another embodiment of the present invention is a mature proteinobtainable by the method described in the preceding paragraph.

In a preferred embodiment, the above method comprises a method of makinga protein comprising the amino acid sequence of the mature proteincontained in one of the sequences of SEQ ID NOs. 85-104, 106-120,122-127 and 155-179, comprising the steps of obtaining a cDNA comprisingone of the nucleotide sequences of SEQ ID Nos. 40-59, 61-75, 77-82 and130-154 which encode for the mature protein, inserting the cDNA in anexpression vector such that the cDNA is operably linked to a promoter,and introducing the expression vector into a host cell whereby the hostcell produces the mature protein encoded by the cDNA. In one aspect ofthis embodiment, the method further comprises the step of isolating theprotein.

Another embodiment of the present invention is a host cell containingthe purified or isolated nucleic acids comprising the sequence of one ofSEQ ID NOs: 40-84 and 130-154 or a sequence complementary theretodescribed herein.

Another embodiment of the present invention is a host cell containingthe purified or isolated nucleic acids comprising the full codingsequences of one of SEQ ID NOs: 40-59, 61-73, 75, 77-82, and 130-154,wherein the full coding sequence comprises the sequence encoding signalpeptide and the sequence encoding mature protein described herein.

Another embodiment of the present invention is a host cell containingthe purified or isolated nucleic acids comprising the nucleotides of oneof SEQ ID NOs: 40-84 and 130-154 which encode a mature protein which aredescribed herein. Preferably, the host cell contains the purified orisolated nucleic acids comprising the nucleotides of one of SEQ ID NOs:40-59, 61-75, 77-82, and 130-154 which encode a mature protein.

Another embodiment of the present invention is a host cell containingthe purified or isolated nucleic acids comprising the nucleotides of oneof SEQ ID NOs: 40-84 and 130-154 which encode the signal peptide whichare described herein. Preferably, the host cell contains the purified orisolated nucleic acids comprising the nucleotides of one of SEQ ID Nos.:40-59, 61-73, 75-82, 84, and 130-154 which encode the signal peptide.

Another embodiment of the present invention is a purified or isolatedantibody capable of specifically binding to a protein having thesequence of one of SEQ ID NOs: 85-129 and 155-179. In one aspect of thisembodiment, the antibody is capable of binding to a polypeptidecomprising at least 10 consecutive amino acids of the sequence of one ofSEQ ID NOs: 85-129 and 155-179.

Another embodiment of the present invention is an array of cDNAs orfragments thereof of at least 15 nucleotides in length which includes atleast one of the sequences of SEQ ID NOs: 40-84 and 130-154, or one ofthe sequences complementary to the sequences of SEQ ID NOs: 40-84 and130-154, or a fragment thereof of at least 15 consecutive nucleotides.In one aspect of this embodiment, the array includes at least two of thesequences of SEQ ID NOs: 40-84 and 130-154, the sequences complementaryto the sequences of SEQ ID NOs: 40-84 and 130-154, or fragments thereofof at least 15 consecutive nucleotides. In another aspect of thisembodiment, the array includes at least five of the sequences of SEQ IDNOs: 40-84 and 130-154, the sequences complementary to the sequences ofSEQ ID NOs: 40-84 and 130-154, or fragments thereof of at least 15consecutive nucleotides.

A further embodiment of the invention encompasses purifiedpolynucleotides comprising an insert from a clone deposited in a deposithaving an accession number selected from the group consisting of theaccession numbers listed in Table VI or a fragment thereof comprising acontiguous span of at least 8, 10, 12, 15, 20, 25, 40, 60, 100, or 200nucleotides of said insert. An additional embodiment of the inventionencompasses purified polypeptides which comprise, consist of, or consistessentially of an amino acid sequence encoded by the insert from a clonedeposited in a deposit having an accession number selected from thegroup consisting of the accession numbers listed in Table VI, as well aspolypeptides which comprise a fragment of said amino acid sequenceconsisting of a signal peptide, a mature protein, or a contiguous spanof at least 5, 8, 10, 12, 15, 20, 25, 40, 60, 100, or 200 amino acidsencoded by said insert.

An additional embodiment of the invention encompasses purifiedpolypeptides which comprise a contiguous span of at least 5, 8, 10, 12,15, 20, 25, 40, 60, 100, or 200 amino acids of SEQ ID NOs: 85-129 and155-179, wherein said contiguous span comprises at least one of theamino acid positions which was not shown to be identical to a publicsequence in any of FIGS. 10 to 12. Also encompassed by the invention arepurified polynucleotides encoding said polypeptides.

Another embodiment of the present invention is a computer readablemedium having stored thereon a sequence selected from the groupconsisting of a cDNA code of SEQ ID NOs. 40-84 and 130-154 and apolypeptide code of SEQ ID NOs. 85-129 and 155-179.

Another embodiment of the present invention is a computer systemcomprising a processor and a data storage device wherein the datastorage device has stored thereon a sequence selected from the groupconsisting of a cDNA code of SEQ ID NOs. 40-84 and 130-154 and apolypeptide code of SEQ ID NOs. 85-129 and 155-179. In some embodimentsthe computer system further comprises a sequence comparer and a datastorage device having reference sequences stored thereon. For example,the sequence comparer may comprise a computer program which indicatespolymorphisms. In other aspects of the computer system, the systemfurther comprises an identifier which identifies features in saidsequence.

Another embodiment of the present invention is a method for comparing afirst sequence to a reference sequence wherein the first sequence isselected from the group consisting of a cDNA code of SEQ ID NOs. 40-84and 130-154 and a polypeptide code of SEQ ID NOs. 85-129 and 155-179comprising the steps of reading the first sequence and the referencesequence through use of a computer program which compares sequences and

determining differences between the first sequence and the referencesequence with the computer program. In some embodiments of the method,the step of determining differences between the first sequence and thereference sequence comprises identifying polymorphisms.

Another embodiment of the present invention is a method for identifyinga feature in a sequence selected from the group consisting of a cDNAcode of SEQ ID NOs. 40-84 and 130-154 and a polypeptide code of SEQ IDNOs. 85-129 and 155-179 comprising the steps of reading the sequencethrough the use of a computer program which identifies features insequences and identifying features in the sequence with said computerprogram.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a summary of a procedure for obtaining cDNAs which have beenselected to include the 5′ ends of the mRNAs from which they arederived.

FIG. 2 is an analysis of the 43 amino terminal amino acids of all humanSwissProt proteins to determine the frequency of false positives andfalse negatives using the techniques for signal peptide identificationdescribed herein.

FIG. 3 shows the distribution of von Heijne scores for 5′ ESTs in eachof the categories described herein and the probability that these 5′ESTs encode a signal peptide.

FIG. 4 shows the distribution of 5′ ESTs in each category and the numberof 5′ ESTs in each category having a given minimum von Heijne's score.

FIG. 5 shows the tissues from which the mRNAs corresponding to the 5′ESTs in each of the categories described herein were obtained.

FIG. 6 illustrates a method for obtaining extended cDNAs.

FIG. 7 is a map of pED6dpc2.

FIG. 8 provides a schematic description of the promoters isolated andthe way they are assembled with the corresponding 5′ tags.

FIG. 9 describes the transcription factor binding sites present in eachof these promoters.

FIG. 10 is an alignment of the proteins of SEQ ID NOs: 120 and 180wherein the signal peptide is in italics, the predicted transmembranesegment is underlined, the experimentally determined transmembranesegment is double underlined, and the ATP1G/PLMN/MAT8 signature is inbold.

FIG. 11 is an alignment of the proteins of SEQ ID NOs: 121 and 181wherein the predicted transmembrane segment is underlined.

FIG. 12 is an alignment of the proteins of SEQ ID NOs: 128 and 182wherein the PPPY motif is in bold.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

I. Obtaining 5′ ESTs

The present extended cDNAs were obtained using 5′ ESTs which wereisolated as described below.

A. Chemical Methods for Obtaining mRNAs having Intact 5′ Ends

In order to obtain the 5′ ESTs used to obtain the extended cDNAs of thepresent invention, mRNAs having intact 5′ ends must be obtained.Currently, there are two approaches for obtaining such mRNAs. One ofthese approaches is a chemical modification method involvingderivatization of the 5′ ends of the mRNAs and selection of thederivatized mRNAs. The 5′ ends of eucaryotic mRNAs possess a structurereferred to as a “cap” which comprises a guanosine methylated at the 7position. The cap is joined to the first transcribed base of the mRNA bya 5′,5′-triphosphate bond. In some instances, the 5′ guanosine ismethylated in both the 2 and 7 positions. Rarely, the 5′ guanosine istrimethylated at the 2, 7 and 7 positions. In the chemical method forobtaining mRNAs having intact 5′ ends, the 5′ cap is specificallyderivatized and coupled to a reactive group on an immobilizingsubstrate. This specific derivatization is based on the fact that onlythe ribose linked to the methylated guanosine at the 5′ end of the mRNAand the ribose linked to the base at the 3′ terminus of the mRNA,possess 2′,3′-cis diols. Optionally, where the 3′ terminal ribose has a2′,3′-cis diol, the 2′,3′-cis diol at the 3′ end may be chemicallymodified, substituted, converted, or eliminated, leaving only the riboselinked to the methylated guanosine at the 5′ end of the mRNA with a2′,3′-cis diol. A variety of techniques are available for eliminatingthe 2′,3′-cis diol on the 3′ terminal ribose. For example, controlledalkaline hydrolysis may be used to generate mRNA fragments in which the3′ terminal ribose is a 3′-phosphate, 2′-phosphate or (2′,3′)-cyclophosphate. Thereafter, the fragment which includes the original3′ ribose may be eliminated from the mixture through chromatography onan oligo-dT column. Alternatively, a base which lacks the 2′,3′-cis diolmay be added to the 3′ end of the mRNA using an RNA ligase such as T4RNA ligase. Example 1 below describes a method for ligation of pCp tothe 3′ end of messenger RNA.

EXAMPLE 1 Ligation of the Nucleoside Diphosphate pCp to the 3′ End ofMessenger RNA

1 μg of RNA was incubated in a final reaction medium of 10 μl in thepresence of 5 U of T₄ phage RNA ligase in the buffer provided by themanufacturer (Gibco-BRL), 40 U of the RNase inhibitor RNASIN (Promega)and, 2 μl of ³² pCp (Amersham #PB 10208).

The incubation was performed at 37° C. for 2 hours or overnight at 7-8°C.

Following modification or elimination of the 2′,3′-cis diol at the 3′ribose, the 2′,3′-cis diol present at the 5′ end of the mRNA may beoxidized using reagents such as NaBH₄, NaBH₃CN, or sodium periodate,thereby converting the 2′,3′-cis diol to a dialdehyde. Example 2describes the oxidation of the 2′,3′-cis diol at the 5′ end of the mRNAwith sodium periodate.

EXAMPLE 2 Oxidation of 2′,3′-cis Diol at the 5′ End of the mRNA

0.1 OD unit of either a capped oligoribonucleotide of 47 nucleotides(including the cap) or an uncapped oligoribonucleotide of 46 nucleotideswere treated as follows. The oligoribonucleotides were produced by invitro transcription using the transcription kit AMPLISCRIBE T7(Epicentre Technologies). As indicated below, the DNA template for theRNA transcript contained a single cytosine. To synthesize the uncappedRNA, all four NTPs were included in the in vitro transcription reaction.To obtain the capped RNA, GTP was replaced by an analogue of the cap,m7G(5′)ppp(5′)G. This compound, recognized by polymerase, wasincorporated into the 5′ end of the nascent transcript during the stepof initiation of transcription but was not capable of incorporationduring the extension step. Consequently, the resulting RNA contained acap at its 5′ end. The sequences of the oligoribonucleotides produced bythe in vitro transcription reaction were: +Cap: (SEQ ID NO:1)5′m7GpppGCAUCCUACUCCCAUCCAAUUCCACCCUAACUCCUCCCAUCU CCAC-3′ −Cap: (SEQ IDNO:2) 5′-pppGCAUCCUACUCCCAUCCAAUUCCACCCUAACUCCUCCCAUCUC- CAC-3′

The oligoribonucleotides were dissolved in 9 μl of acetate buffer (0.1 Msodium acetate, pH 5.2) and 3 μl of freshly prepared 0.1 M sodiumperiodate solution. The mixture was incubated for 1 hour in the dark at4° C. or room temperature. Thereafter, the reaction was stopped byadding 4 μl of 10% ethylene glycol. The product was ethanolprecipitated, resuspended in 10 μl or more of water or appropriatebuffer and dialyzed against water.

The resulting aldehyde groups may then be coupled to molecules having areactive amine group, such as hydrazine, carbazide, thiocarbazide orsemicarbazide groups, in order to facilitate enrichment of the 5′ endsof the mRNAs. Molecules having reactive amine groups which are suitablefor use in selecting mRNAs having intact 5′ ends include avidin,proteins, antibodies, vitamins, ligands capable of specifically bindingto receptor molecules, or oligonucleotides. Example 3 below describesthe coupling of the resulting dialdehyde to biotin.

EXAMPLE 3 Coupling of the Dialdehyde with Biotin

The oxidation product obtained in Example 2 was dissolved in 50 μl ofsodium acetate at a pH of between 5 and 5.2 and 50 μl of freshlyprepared 0.02 M solution of biotin hydrazide in a methoxyethanol/watermixture (1:1) of formula:

In the compound used in these experiments, n=5. However, it will beappreciated that other commercially available hydrazides may also beused, such as molecules of the formula above in which n varies from 0 to5.

The mixture was then incubated for 2 hours at 37° C. Following theincubation, the mixture was precipitated with ethanol and dialyzedagainst distilled water.

Example 4 demonstrates the specificity of the biotinylation reaction.

EXAMPLE 4 Specificity of Biotinylation

The specificity of the biotinylation for capped mRNAs was evaluated bygel electrophoresis of the following samples:

Sample 1. The 46 nucleotide uncapped in vitro transcript prepared as inExample 2 and labeled with ³² pCp as described in Example 1.

Sample 2. The 46 nucleotide uncapped in vitro transcript prepared as inExample 2, labeled with ³² pCp as described in Example 1, treated withthe oxidation reaction of Example 2, and subjected to the biotinylationconditions of Example 3.

Sample 3. The 47 nucleotide capped in vitro transcript prepared as inExample 2 and labeled with ³² pCp as described in Example 1.

Sample 4. The 47 nucleotide capped in vitro transcript prepared as inExample 2, labeled with ³² pCp as described in Example 1, treated withthe oxidation reaction of Example 2, and subjected to the biotinylationconditions of Example 3.

Samples 1 and 2 had identical migration rates, demonstrating that theuncapped RNAs were not oxidized and biotinylated. Sample 3 migrated moreslowly than Samples 1 and 2, while Sample 4 exhibited the slowestmigration. The difference in migration of the RNAs in Samples 3 and 4demonstrates that the capped RNAs were specifically biotinylated.

In some cases, mRNAs having intact 5′ ends may be enriched by bindingthe molecule containing a reactive amine group to a suitable solid phasesubstrate such as the inside of the vessel containing the mRNAs,magnetic beads, chromatography matrices, or nylon or nitrocellulosemembranes. For example, where the molecule having a reactive amine groupis biotin, the solid phase substrate may be coupled to avidin orstreptavidin. Alternatively, where the molecule having the reactiveamine group is an antibody or receptor ligand, the solid phase substratemay be coupled to the cognate antigen or receptor. Finally, where themolecule having a reactive amine group comprises an oligonucleotide, thesolid phase substrate may comprise a complementary oligonucleotide.

The mRNAs having intact 5′ ends may be released from the solid phasefollowing the enrichment procedure. For example, where the dialdehyde iscoupled to biotin hydrazide and the solid phase comprises streptavidin,the mRNAs may be released from the solid phase by simply heating to 95degrees Celsius in 2% SDS. In some methods, the molecule having areactive amine group may also be cleaved from the mRNAs having intact 5′ends following enrichment. Example 5 describes the capture ofbiotinylated mRNAs with streptavidin coated beads and the release of thebiotinylated mRNAs from the beads following enrichment.

EXAMPLE 5 Capture and Release of Biotinylated mRNAs Using StrepatividinCoated Beads

The streptavidin-coated magnetic beads were prepared according to themanufacturer's instructions (CPG Inc., USA). The biotinylated mRNAs wereadded to a hybridization buffer (1.5 M NaCl, pH 5-6). After incubatingfor 30 minutes, the unbound and nonbiotinylated material was removed.The beads were washed several times in water with 1% SDS. The beadsobtained were incubated for 15 minutes at 95° C. in water containing 2%SDS.

Example 6 demonstrates the efficiency with which biotinylated mRNAs wererecovered from the streptavidin coated beads.

EXAMPLE 6 Efficiency of Recovery of Biotinylated mRNAs

The efficiency of the recovery procedure was evaluated as follows. RNAswere labeled with ³² pCp, oxidized, biotinylated and bound tostreptavidin coated beads as described above. Subsequently, the boundRNAs were incubated for 5, 15 or 30 minutes at 95° C. in the presence of2% SDS.

The products of the reaction were analyzed by electrophoresis on 12%polyacrylamide gels under denaturing conditions (7 M urea). The gelswere subjected to autoradiography. During this manipulation, thehydrazone bonds were not reduced.

Increasing amounts of nucleic acids were recovered as incubation timesin 2% SDS increased, demonstrating that biotinylated mRNAs wereefficiently recovered.

In an alternative method for obtaining mRNAs having intact 5′ ends, anoligonucleotide which has been derivatized to contain a reactive aminegroup is specifically coupled to mRNAs having an intact cap. Preferably,the 3′ end of the mRNA is blocked prior to the step in which thealdehyde groups are joined to the derivatized oligonucleotide, asdescribed above, so as to prevent the derivatized oligonucleotide frombeing joined to the 3′ end of the mRNA. For example, pCp may be attachedto the 3′ end of the mRNA using T4 RNA ligase. However, as discussedabove, blocking the 3′ end of the mRNA is an optional step. Derivatizedoligonucleotides may be prepared as described below in Example 7.

EXAMPLE 7 Derivatization of the Oligonucleotide

An oligonucleotide phosphorylated at its 3′ end was converted to a 3′hydrazide in 3′ by treatment with an aqueous solution of hydrazine or ofdihydrazide of the formula H₂N(R1)NH₂ at about 1 to 3 M, and at pH 4.5,in the presence of a carbodiimide type agent soluble in water such as1-ethyl-3-(3-dimethylaminopropyl)carbodiimide at a final concentrationof 0.3 M at a temperature of 8° C. overnight.

The derivatized oligonucleotide was then separated from the other agentsand products using a standard technique for isolating oligonucleotides.

As discussed above, the mRNAs to be enriched may be treated to eliminatethe 3′ OH groups which may be present thereon. This may be accomplishedby enzymatic ligation of sequences lacking a 3′ OH, such as pCp, asdescribed above in Example 1. Alternatively, the 3′ OH groups may beeliminated by alkaline hydrolysis as described in Example 8 below.

EXAMPLE 8 Alkaline Hydrolysis of mRNA

The mRNAs may be treated with alkaline hydrolysis as follows. In a totalvolume of 100 μl of 0.1N sodium hydroxide, 1.5 μg mRNA is incubated for40 to 60 minutes at 4° C. The solution is neutralized with acetic acidand precipitated with ethanol.

Following the optional elimination of the 3′ OH groups, the diol groupsat the 5′ ends of the mRNAs are oxidized as described below in Example9.

EXAMPLE 9 Oxidation of Diols

Up to 1 OD unit of RNA was dissolved in 9 μl of buffer (0.1 M sodiumacetate, pH 6-7 or water) and 3 μl of freshly prepared 0.1 M sodiumperiodate solution. The reaction was incubated for 1 h in the dark at 4°C. or room temperature. Following the incubation, the reaction wasstopped by adding 4 μl of 10% ethylene glycol. Thereafter the mixturewas incubated at room temperature for 15 minutes. After ethanolprecipitation, the product was resuspended in 10 μl or more of water orappropriate buffer and dialyzed against water.

Following oxidation of the diol groups at the 5′ ends of the mRNAs, thederivatized oligonucleotide was joined to the resulting aldehydes asdescribed in Example 10.

EXAMPLE 10 Reaction of Aldehydes with Derivatized Oligonucleotides

The oxidized mRNA was dissolved in an acidic medium such as 50 μl ofsodium acetate pH 4-6. 50 μl of a solution of the derivatizedoligonucleotide was added such that an mRNA:derivatized oligonucleotideratio of 1:20 was obtained and mixture was reduced with a borohydride.The mixture was allowed to incubate for 2 h at 37° C. or overnight (14h) at 10° C. The mixture was ethanol precipitated, resuspended in 10 μlor more of water or appropriate buffer and dialyzed against distilledwater. If desired, the resulting product may be analyzed usingacrylamide gel electrophoresis, HPLC analysis, or other conventionaltechniques.

Following the attachment of the derivatized oligonucleotide to themRNAs, a reverse transcription reaction may be performed as described inExample 11 below.

EXAMPLE 11 Reverse Transcription of mRNAs

An oligodeoxyribonucleotide was derivatized as follows. 3 OD units of anoligodeoxyribonucleotide of sequence ATCAAGAATTCGCACGAGACCATTA (SEQ IDNO:3) having 5′-OH and 3′-P ends were dissolved in 70 μl of a 1.5 Mhydroxybenzotriazole solution, pH 5.3, prepared indimethylformamide/water (75:25) containing 2 μg of1-ethyl-3-(3-dimethylaminopropyl)carbodiimide. The mixture was incubatedfor 2 h 30 min at 22° C. The mixture was then precipitated twice inLiClO₄/acetone. The pellet was resuspended in 200 μl of 0.25 M hydrazineand incubated at 8° C. from 3 to 14 h. Following the hydrazine reaction,the mixture was precipitated twice in LiClO₄/acetone.

The messenger RNAs to be reverse transcribed were extracted from blocksof placenta having sides of 2 cm which had been stored at −80° C. ThemRNA was extracted using conventional acidic phenol techniques. Oligo-dTchromatography was used to purify the mRNAs. The integrity of the mRNAswas checked by Northern-blotting.

The diol groups on 7 μg of the placental mRNAs were oxidized asdescribed above in Example 9. The derivatized oligonucleotide was joinedto the mRNAs as described in Example 10 above except that theprecipitation step was replaced by an exclusion chromatography step toremove derivatized oligodeoxyribonucleotides which were not joined tomRNAs. Exclusion chromatography was performed as follows:

10 ml of ACA34 GEL (BioSepra#230151) were equilibrated in 50 ml of asolution of 10 mM Tris pH 8.0, 300 mM NaCl, 1 mM EDTA, and 0.05% SDS.The mixture was allowed to sediment. The supernatant was eliminated andthe gel was resuspended in 50 ml of buffer. This procedure was repeated2 or 3 times.

A glass bead (diameter 3 mm) was introduced into a 2 ml disposablepipette (length 25 cm). The pipette was filled with the gel suspensionuntil the height of the gel stabilized at 1 cm from the top of thepipette. The column was then equilibrated with 20 ml of equilibrationbuffer (10 mM Tris HCl pH 7.4, 20 mM NaCl).

10 μl of the mRNA which had been reacted with the derivatizedoligonucleotide were mixed in 39 μl of 10 mM urea and 2 μl ofblue-glycerol buffer, which had been prepared by dissolving 5 mg ofbromophenol blue in 60% glycerol (v/v), and passing the mixture througha filter with a filter of diameter 0.45 μm.

The column was loaded. As soon as the sample had penetrated,equilibration buffer was added. 100 μl fractions were collected.Derivatized oligonucleotide which had not been attached to mRNA appearedin fraction 16 and later fractions. Fractions 3 to 15 were combined andprecipitated with ethanol.

The mRNAs which had been reacted with the derivatized oligonucleotidewere spotted on a nylon membrane and hybridized to a radioactive probeusing conventional techniques. The radioactive probe used in thesehybridizations was an oligodeoxyribonucleotide of sequenceTAATGGTCTCGTGCGAATTCTTGAT (SEQ ID NO:4) which was anticomplementary tothe derivatized oligonucleotide and was labeled at its 5′ end with ³²P.1/10th of the mRNAs which had been reacted with the derivatizedoligonucleotide was spotted in two spots on the membrane and themembrane was visualized by autoradiography after hybridization of theprobe. A signal was observed, indicating that the derivatizedoligonucleotide had been joined to the mRNA.

The remaining 9/10 of the mRNAs which had been reacted with thederivatized oligonucleotide was reverse transcribed as follows. Areverse transcription reaction was carried out with reversetranscriptase following the manufacturer's instructions. To prime thereaction, 50 pmol of nonamers with random sequence were used.

A portion of the resulting cDNA was spotted on a positively chargednylon membrane using conventional methods. The cDNAs were spotted on themembrane after the cDNA:RNA heteroduplexes had been subjected to analkaline hydrolysis in order to eliminate the RNAs. An oligonucleotidehaving a sequence identical to that of the derivatized oligonucleotidewas labeled at its 5′ end with ³²P and hybridized to the cDNA blotsusing conventional techniques. Single-stranded cDNAs resulting from thereverse transcription reaction were spotted on the membrane. Ascontrols, the blot contained 1 pmol, 100 fmol, 50 fmol, 10 fmol and 1fmol respectively of a control oligodeoxyribonucleotide of sequenceidentical to that of the derivatized oligonucleotide. The signalobserved in the spots containing the cDNA indicated that approximately15 fmol of the derivatized oligonucleotide had been reverse transcribed.

These results demonstrate that the reverse transcription can beperformed through the cap and, in particular, that reverse transcriptasecrosses the 5′-P-P-P-5′ bond of the cap of eukaryotic messenger RNAs.

The single stranded cDNAs obtained after the above first strandsynthesis were used as template for PCR reactions. Two types ofreactions were carried out. First, specific amplification of the mRNAsfor the alpha globin, dehydrogenase, pp15 and elongation factor E4 werecarried out using the following pairs of oligodeoxyribonucleotideprimers. alpha-globin (SEQ ID NO:5) GLO-S:   CCG ACA AGA CCA ACG TCA AGGCCG C (SEQ ID NO:6) GLO-As:  TCA CCA GCA GGC AGT GGC TTA GGA G 3′dehydrogenase (SEQ ID NO:7) 3 DH-S:  AGT GAT TCC TGC TAC TTT GGA TGG C(SEQ ID NO:8) 3 DH-As: GCT TGG TCT TGT TCT GGA GTT TAG A pp15 (SEQ IDNO:9) PP15-S:  TCC AGA ATG GGA GAC AAG CCA ATT T (SEQ ID NO:10) PP15-As:AGG GAG GAG GAA ACA GCG TGA GTC C Elongation factor E4 (SEQ ID NO:11)EFA1-S:  ATG GGA AAG GAA AAG ACT CAT ATC A (SEQ ID NO:12) EF1A-As: AGCAGC AAC AAT CAG GAC AGC ACA G

Non specific amplifications were also carried out with the antisense(_As) oligodeoxyribonucleotides of the pairs described above and aprimer chosen from the sequence of the derivatizedoligodeoxyribonucleotide (ATCAAGAATTCGCACGAGACCATTA) (SEQ ID NO:13).

A 1.5% agarose gel containing the following samples corresponding to thePCR products of reverse transcription was stained with ethidium bromide.( 1/20th of the products of reverse transcription were used for each PCRreaction).

Sample 1: The products of a PCR reaction using the globin primers of SEQID NOs 5 and 6 in the presence of cDNA.

Sample 2: The products of a PCR reaction using the globin primers of SEQID NOs 5 and 6 in the absence of added cDNA.

Sample 3: The products of a PCR reaction using the dehydrogenase primersof SEQ ID NOs 7 and 8 in the presence of cDNA.

Sample 4: The products of a PCR reaction using the dehydrogenase primersof SEQ ID NOs 7 and 8 in the absence of added cDNA.

Sample 5: The products of a PCR reaction using the pp15 primers of SEQID NOs 9 and 10 in the presence of cDNA.

Sample 6: The products of a PCR reaction using the pp15 primers of SEQID NOs 9 and 10 in the absence of added cDNA.

Sample 7: The products of a PCR reaction using the EIE4 primers of SEQID NOs 11 and 12 in the presence of added cDNA.

Sample 8: The products of a PCR reaction using the EIE4 primers of SEQID NOs 11 and 12 in the absence of added cDNA.

In Samples 1, 3, 5 and 7, a band of the size expected for the PCRproduct was observed, indicating the presence of the correspondingsequence in the cDNA population.

PCR reactions were also carried out with the antisense oligonucleotidesof the globin and dehydrogenase primers (SEQ ID NOs 6 and 8) and anoligonucleotide whose sequence corresponds to that of the derivatizedoligonucleotide. The presence of PCR products of the expected size inthe samples corresponding to samples 1 and 3 above indicated that thederivatized oligonucleotide had been incorporated.

The above examples summarize the chemical procedure for enriching mRNAsfor those having intact 5′ ends. Further detail regarding the chemicalapproaches for obtaining mRNAs having intact 5′ ends are disclosed inInternational Application No. WO96/34981, published Nov. 7, 1996, whichis incorporated herein by reference.

Strategies based on the above chemical modifications to the 5′ capstructure may be utilized to generate cDNAs which have been selected toinclude the 5′ ends of the mRNAs from which they are derived. In oneversion of such procedures, the 5′ ends of the mRNAs are modified asdescribed above. Thereafter, a reverse transcription reaction isconducted to extend a primer complementary to the mRNA to the 5′ end ofthe mRNA. Single stranded RNAs are eliminated to obtain a population ofcDNA/mRNA heteroduplexes in which the mRNA includes an intact 5′ end.The resulting heteroduplexes may be captured on a solid phase coatedwith a molecule capable of interacting with the molecule used toderivatize the 5′ end of the mRNA. Thereafter, the strands of theheteroduplexes are separated to recover single stranded first cDNAstrands which include the 5′ end of the mRNA. Second strand cDNAsynthesis may then proceed using conventional techniques. For example,the procedures disclosed in WO 96/34981 or in Carninci, P. et al.High-Efficiency Full-Length cDNA Cloning by Biotinylated CAP Trapper.Genomics 37:327-336 (1996), the disclosures of which are incorporatedherein by reference, may be employed to select cDNAs which include thesequence derived from the 5′ end of the coding sequence of the mRNA.

Following ligation of the oligonucleotide tag to the 5′ cap of the mRNA,a reverse transcription reaction is conducted to extend a primercomplementary to the mRNA to the 5′ end of the mRNA. Followingelimination of the RNA component of the resulting heteroduplex usingstandard techniques, second strand cDNA synthesis is conducted with aprimer complementary to the oligonucleotide tag.

FIG. 1 summarizes the above procedures for obtaining cDNAs which havebeen selected to include the 5′ ends of the mRNAs from which they arederived.

B. Enzymatic Methods for Obtaining mRNAs having Intact 5′ Ends

Other techniques for selecting cDNAs extending to the 5′ end of the mRNAfrom which they are derived are fully enzymatic. Some versions of thesetechniques are disclosed in Dumas Milne Edwards J. B. (Doctoral Thesisof Paris VI University, Le clonage des ADNc complets: difficultes etperspectives nouvelles. Apports pour l'etude de la regulation del'expression de la tryptophane hydroxylase de rat, 20 Dec. 1993), EP0625572 and Kato et al. Construction of a Human Full-Length cDNA Bank.Gene 150:243-250 (1994), the disclosures of which are incorporatedherein by reference.

Briefly, in such approaches, isolated mRNA is treated with alkalinephosphatase to remove the phosphate groups present on the 5′ ends ofuncapped incomplete mRNAs. Following this procedure, the cap present onfull length mRNAs is enzymatically removed with a decapping enzyme suchas T4 polynucleotide kinase or tobacco acid pyrophosphatase. Anoligonucleotide, which may be either a DNA oligonucleotide or a DNA-RNAhybrid oligonucleotide having RNA at its 3′ end, is then ligated to thephosphate present at the 5′ end of the decapped mRNA using T4 RNAligase. The oligonucleotide may include a restriction site to facilitatecloning of the cDNAs following their synthesis. Example 12 belowdescribes one enzymatic method based on the doctoral thesis of Dumas.

EXAMPLE 12 Enzymatic Approach for Obtaining 5′ ESTs

Twenty micrograms of PolyA+ RNA were dephosphorylated using CalfIntestinal Phosphatase (Biolabs). After a phenol chloroform extraction,the cap structure of mRNA was hydrolyzed using the Tobacco AcidPyrophosphatase (purified as described by Shinshi et al., Biochemistry15: 2185-2190, 1976) and a hemi 5′DNA/RNA-3′ oligonucleotide having anunphosphorylated 5′ end, a stretch of adenosine ribophosphate at the 3′end, and an EcoRI site near the 5′ end was ligated to the 5′P ends ofmRNA using the T4 RNA ligase (Biolabs). Oligonucleotides suitable foruse in this procedure are preferably 30-50 bases in length.Oligonucleotides having an unphosphorylated 5′ end may be synthesized byadding a fluorochrome at the 5′ end. The inclusion of a stretch ofadenosine ribophosphates at the 3′ end of the oligonucleotide increasesligation efficiency. It will be appreciated that the oligonucleotide maycontain cloning sites other than EcoRI.

Following ligation of the oligonucleotide to the phosphate present atthe 5′ end of the decapped mRNA, first and second strand cDNA synthesismay be carried out using conventional methods or those specified in EP0625,572 and Kato et al. Construction of a Human Full-Length cDNA Bank.Gene 150:243-250 (1994), and Dumas Milne Edwards, supra, the disclosuresof which are incorporated herein by reference. The resulting cDNA maythen be ligated into vectors such as those disclosed in Kato et al.Construction of a Human Full-Length cDNA Bank. Gene 150:243-250 (1994)or other nucleic acid vectors known to those skilled in the art usingtechniques such as those described in Sambrook et al., MolecularCloning: A Laboratory Manual 2d Ed., Cold Spring Harbor LaboratoryPress, 1989, the disclosure of which is incorporated herein byreference.

II. Characterization of 5′ ESTs

The above chemical and enzymatic approaches for enriching mRNAs havingintact 5′ ends were employed to obtain 5′ ESTs. First, mRNAs wereprepared as described in Example 13 below.

EXAMPLE 13 Preparation of mRNA

Total human RNAs or PolyA+ RNAs derived from 29 different tissues wererespectively purchased from LABIMO and CLONTECH and used to generate 44cDNA libraries as described below. The purchased RNA had been isolatedfrom cells or tissues using acid guanidium thiocyanate-phenol-chloroformextraction (Chomczyniski, P and Sacchi, N., Analytical Biochemistry162:156-159, 1987). PolyA+ RNA was isolated from total RNA (LABIMO) bytwo passes of oligodT chromatography, as described by Aviv and Leder(Aviv, H. and Leder, P., Proc. Natl. Acad. Sci. USA 69:1408-1412, 1972)in order to eliminate ribosomal RNA.

The quality and the integrity of the poly A+ were checked. Northernblots hybridized with a globin probe were used to confirm that the mRNAswere not degraded. Contamination of the PolyA+ mRNAs by ribosomalsequences was checked using RNAs blots and a probe derived from thesequence of the 28S RNA. Preparations of mRNAs with less than 5% ofribosomal RNAs were used in library construction. To avoid constructinglibraries with RNAs contaminated by exogenous sequences (prokaryotic orfungal), the presence of bacterial 16S ribosomal sequences or of twohighly expressed mRNAs was examined using PCR.

Following preparation of the mRNAs, the above described chemical and/orthe enzymatic procedures for enriching mRNAs having intact 5′ endsdiscussed above were employed to obtain 5′ ESTs from various tissues. Inboth approaches an oligonucleotide tag was attached to the cap at the 5′ends of the mRNAs. The oligonucleotide tag had an EcoRI site therein tofacilitate later cloning procedures.

Following attachment of the oligonucleotide tag to the mRNA by eitherthe chemical or enzymatic methods, the integrity of the mRNA wasexamined by performing a Northern blot with 200-500 ng of mRNA using aprobe complementary to the oligonucleotide tag.

EXAMPLE 14 cDNA Synthesis Using mRNA Templates Having Intact 5′ Ends

For the mRNAs joined to oligonucleotide tags using both the chemical andenzymatic methods, first strand cDNA synthesis was performed usingreverse transcriptase with random nonamers as primers. In order toprotect internal EcoRI sites in the cDNA from digestion at later stepsin the procedure, methylated dCTP was used for first strand synthesis.After removal of RNA by an alkaline hydrolysis, the first strand of cDNAwas precipitated using isopropanol in order to eliminate residualprimers.

For both the chemical and the enzymatic methods, the second strand ofthe cDNA was synthesized with a Klenow fragment using a primercorresponding to the 5′end of the ligated oligonucleotide described inExample 12. Preferably, the primer is 20-25 bases in length. MethylateddCTP was also used for second strand synthesis in order to protectinternal EcoRI sites in the cDNA from digestion during the cloningprocess.

Following cDNA synthesis, the cDNAs were cloned into pBlueScript asdescribed in Example 15 below.

EXAMPLE 15 Insertion of cDNAs into BlueScript

Following second strand synthesis, the ends of the cDNA were bluntedwith T4 DNA polymerase (Biolabs) and the cDNA was digested with EcoRI.Since methylated dCTP was used during cDNA synthesis, the EcoRI sitepresent in the tag was the only site which was hemi-methylated.Consequently, only the EcoRI site in the oligonucleotide tag wassusceptible to EcoRI digestion. The cDNA was then size fractionatedusing exclusion chromatography (AcA, Biosepra). Fractions correspondingto cDNAs of more than 150 bp were pooled and ethanol precipitated. ThecDNA was directionally cloned into the SmaI and EcoRI ends of thephagemid PBLUESCRIPT vector (Stratagene). The ligation mixture waselectroporated into bacteria and propagated under appropriate antibioticselection.

Clones containing the oligonucleotide tag attached were selected asdescribed in Example 16 below.

EXAMPLE 16 Selection of Clones Having the Oligonucleotide Tag AttachedThereto

The plasmid DNAs containing 5′ EST libraries made as described abovewere purified (Qiagen). A positive selection of the tagged clones wasperformed as follows. Briefly, in this selection procedure, the plasmidDNA was converted to single stranded DNA using gene II endonuclease ofthe phage F1 in combination with an exonuclease (Chang et al., Gene127:95-8, 1993) such as exonuclease II1 or T7 gene 6 exonuclease. Theresulting single stranded DNA was then purified using paramagnetic beadsas described by Fry et al., Biotechniques, 13: 124-131, 1992. In thisprocedure, the single stranded DNA was hybridized with a biotinylatedoligonucleotide having a sequence corresponding to the 3′ end of theoligonucleotide described in Example 13. Preferably, the primer has alength of 20-25 bases. Clones including a sequence complementary to thebiotinylated oligonucleotide were captured by incubation withstreptavidin coated magnetic beads followed by magnetic selection. Aftercapture of the positive clones, the plasmid DNA was released from themagnetic beads and converted into double stranded DNA using a DNApolymerase such as the ThermoSequenase obtained from Amersham PharmaciaBiotech. Alternatively, protocols such as the Gene Trapper kit (GibcoBRL) may be used. The double stranded DNA was then electroporated intobacteria. The percentage of positive clones having the 5′ tagoligonucleotide was estimated to typically rank between 90 and 98% usingdot blot analysis.

Following electroporation, the libraries were ordered in 384-microtiterplates (MTP). A copy of the MTP was stored for future needs. Then thelibraries were transferred into 96 MTP and sequenced as described below.

EXAMPLE 17 Sequencing of Inserts in Selected Clones

Plasmid inserts were first amplified by PCR on PE 9600 thermocyclers(Perkin-Elmer), using standard SETA-A and SETA-B primers (Genset SA),AMPLITAQGOLD (Perkin-Elmer), dNTPs (Boehringer), buffer and cyclingconditions as recommended by the Perkin-Elmer Corporation.

PCR products were then sequenced using automatic ABI Prism 377sequencers (Perkin Elmer, Applied Biosystems Division, Foster City,Calif.). Sequencing reactions were performed using PE 9600 thermocyclers(Perkin Elmer) with standard dye-primer chemistry and THERMOSEQUENASE(Amersham Life Science). The primers used were either T7 or 21M13(available from Genset SA) as appropriate. The primers were labeled withthe JOE, FAM, ROX and TAMRA dyes. The dNTPs and ddNTPs used in thesequencing reactions were purchased from Boehringer. Sequencing buffer,reagent concentrations and cycling conditions were as recommended byAmersham.

Following the sequencing reaction, the samples were precipitated withEtOH, resuspended in formamide loading buffer, and loaded on a standard4% acrylamide gel. Electrophoresis was performed for 2.5 hours at 3000Von an ABI 377 sequencer, and the sequence data were collected andanalyzed using the ABI Prism DNA Sequencing Analysis Software, version2.1.2.

The sequence data from the 44 cDNA libraries made as described abovewere transferred to a proprietary database, where quality control andvalidation steps were performed. A proprietary base-caller (“Trace”),working using a Unix system automatically flagged suspect peaks, takinginto account the shape of the peaks, the inter-peak resolution, and thenoise level. The proprietary base-caller also performed an automatictrimming. Any stretch of 25 or fewer bases having more than 4 suspectpeaks was considered unreliable and was discarded. Sequencescorresponding to cloning vector or ligation oligonucleotides wereautomatically removed from the EST sequences. However, the resulting ESTsequences may contain 1 to 5 bases belonging to the above mentionedsequences at their 5′ end. If needed, these can easily be removed on acase by case basis.

Thereafter, the sequences were transferred to the proprietary NETGENE™Database for further analysis as described below.

Following sequencing as described above, the sequences of the 5′ ESTswere entered in a proprietary database called NETGENE™ for storage andmanipulation. It will be appreciated by those skilled in the art thatthe data could be stored and manipulated on any medium which can be readand accessed by a computer. Computer readable media include magneticallyreadable media, optically readable media, or electronically readablemedia. For example, the computer readable media may be a hard disc, afloppy disc, a magnetic tape, CD-ROM, RAM, or ROM as well as other typesof other media known to those skilled in the art.

In addition, the sequence data may be stored and manipulated in avariety of data processor programs in a variety of formats. For example,the sequence data may be stored as text in a word processing file, suchas MicrosoftWORD or WORDPERFECT or as an ASCII file in a variety ofdatabase programs familiar to those of skill in the art, such as DB2,SYBASE, or ORACLE.

The computer readable media on which the sequence information is storedmay be in a personal computer, a network, a server or other computersystems known to those skilled in the art. The computer or other systempreferably includes the storage media described above, and a processorfor accessing and manipulating the sequence data. Once the sequence datahas been stored it may be manipulated and searched to locate thosestored sequences which contain a desired nucleic acid sequence or whichencode a protein having a particular functional domain. For example, thestored sequence information may be compared to other known sequences toidentify homologies, motifs implicated in biological function, orstructural motifs.

Programs which may be used to search or compare the stored sequencesinclude the MACPATTERN (EMBL), BLAST, and BLAST2 program series (NCBI),basic local alignment search tool programs for nucleotide (BLASTN) andpeptide (BLASTX) comparisons (Altschul et al, J. Mol. Biol. 215: 403(1990)) and FASTA (Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85:2444 (1988)). The BLAST programs then extend the alignments on the basisof defined match and mismatch criteria.

Motifs which may be detected using the above programs include sequencesencoding leucine zippers, helix-turn-helix motifs, glycosylation sites,ubiquitination sites, alpha helices, and beta sheets, signal sequencesencoding signal peptides which direct the secretion of the encodedproteins, sequences implicated in transcription regulation such ashomeoboxes, acidic stretches, enzymatic active sites, substrate bindingsites, and enzymatic cleavage sites.

Before searching the cDNAs in the NETGENE™ database for sequence motifsof interest, cDNAs derived from mRNAs which were not of interest wereidentified and eliminated from further consideration as described inExample 18 below.

EXAMPLE 18 Elimination of Undesired Sequences from Further Consideration

5′ ESTs in the NETGENE™ database which were derived from undesiredsequences such as transfer RNAs, ribosomal RNAs, mitochondrial RNAs,procaryotic RNAs, fungal RNAs, Alu sequences, L1 sequences, or repeatsequences were identified using the FASTA and BLASTN programs with theparameters listed in Table II.

To eliminate 5′ ESTs encoding tRNAs from further consideration, the 5′EST sequences were compared to the sequences of 1190 known tRNAsobtained from EMBL release 38, of which 100 were human. The comparisonwas performed using FASTA on both strands of the 5′ ESTs. Sequenceshaving more than 80% homology over more than 60 nucleotides wereidentified as tRNA. Of the 144,341 sequences screened, 26 wereidentified as tRNAs and eliminated from further consideration.

To eliminate 5′ ESTs encoding rRNAs from further consideration, the 5′EST sequences were compared to the sequences of 2497 known rRNAsobtained from EMBL release 38, of which 73 were human. The comparisonwas performed using BLASTN on both strands of the 5′ ESTs with theparameter S=108. Sequences having more than 80% homology over stretcheslonger than 40 nucleotides were identified as rRNAs. Of the 144,341sequences screened, 3,312 were identified as rRNAs and eliminated fromfurther consideration.

To eliminate 5′ ESTs encoding mtRNAs from further consideration, the 5′EST sequences were compared to the sequences of the two knownmitochondrial genomes for which the entire genomic sequences areavailable and all sequences transcribed from these mitochondrial genomesincluding tRNAs, rRNAs, and mRNAs for a total of 38 sequences. Thecomparison was performed using BLASTN on both strands of the 5′ ESTswith the parameter S=108. Sequences having more than 80% homology overstretches longer than 40 nucleotides were identified as mtRNAs. Of the144,341 sequences screened, 6,110 were identified as mtRNAs andeliminated from further consideration.

Sequences which might have resulted from exogenous contaminants wereeliminated from further consideration by comparing the 5′ EST sequencesto release 46 of the EMBL bacterial and fungal divisions using BLASTNwith the parameter S=144. All sequences having more than 90% homologyover at least 40 nucleotides were identified as exogenous contaminants.Of the 42 cDNA libraries examined, the average percentages ofprocaryotic and fungal sequences contained therein were 0.2% and 0.5%respectively. Among these sequences, only one could be identified as asequence specific to fungi. The others were either fungal or procaryoticsequences having homologies with vertebrate sequences or includingrepeat sequences which had not been masked during the electroniccomparison.

In addition, the 5′ ESTs were compared to 6093 Alu sequences and 1115 L1sequences to mask 5′ ESTs containing such repeat sequences from furtherconsideration. 5′ ESTs including THE and MER repeats, SSTR sequences orsatellite, micro-satellite, or telomeric repeats were also eliminatedfrom further consideration. On average, 11.5% of the sequences in thelibraries contained repeat sequences. Of this 11.5%, 7% contained Alurepeats, 3.3% contained L1 repeats and the remaining 1.2% were derivedfrom the other types of repetitive sequences which were screened. Thesepercentages are consistent with those found in cDNA libraries preparedby other groups. For example, the cDNA libraries of Adams et al.contained between 0% and 7.4% Alu repeats depending on the source of theRNA which was used to prepare the cDNA library (Adams et al., Nature377:174, 1996).

The sequences of those 5′ ESTs remaining after the elimination ofundesirable sequences were compared with the sequences of known humanmRNAs to determine the accuracy of the sequencing procedures describedabove.

EXAMPLE 19 Measurement of Sequencing Accuracy by Comparison to KnownSequences

To further determine the accuracy of the sequencing procedure describedabove, the sequences of 5′ ESTs derived from known sequences wereidentified and compared to the known sequences. First, a FASTA analysiswith overhangs shorter than 5 bp on both ends was conducted on the 5′ESTs to identify those matching an entry in the public human mRNAdatabase. The 6655 5′ ESTs which matched a known human mRNA were thenrealigned with their cognate mRNA and dynamic programming was used toinclude substitutions, insertions, and deletions in the list of “errors”which would be recognized. Errors occurring in the last 10 bases of the5′ EST sequences were ignored to avoid the inclusion of spurious cloningsites in the analysis of sequencing accuracy.

This analysis revealed that the sequences incorporated in the NETGENE™database had an accuracy of more than 99.5%.

To determine the efficiency with which the above selection proceduresselect cDNAs which include the 5′ ends of their corresponding mRNAs, thefollowing analysis was performed.

EXAMPLE 20 Determination of Efficiency of 5′ EST Selection

To determine the efficiency at which the above selection proceduresisolated 5′ ESTs which included sequences close to the 5′ end of themRNAs from which they were derived, the sequences of the ends of the 5′ESTs which were derived from the elongation factor 1 subunit a andferritin heavy chain genes were compared to the known cDNA sequences forthese genes. Since the transcription start sites for the elongationfactor 1 subunit α and ferritin heavy chain are well characterized, theymay be used to determine the percentage of 5′ ESTs derived from thesegenes which included the authentic transcription start sites.

For both genes, more than 95% of the cDNAs included sequences close toor upstream of the 5′ end of the corresponding mRNAs.

To extend the analysis of the reliability of the procedures forisolating 5′ ESTs from ESTs in the NETGENE™ database, a similar analysiswas conducted using a database composed of human mRNA sequencesextracted from GENBANK database release 97 for comparison. For those 5′ESTs derived from mRNAs included in the GENBANK database, more than 85%had their 5′ ends close to the 5′ ends of the known sequence. As some ofthe mRNA sequences available in the GENBANK database are deduced fromgenomic sequences, a 5′ end matching with these sequences will becounted as an internal match. Thus, the method used here underestimatesthe yield of ESTs including the authentic 5′ ends of their correspondingmRNAs.

The EST libraries made above included multiple 5′ ESTs derived from thesame mRNA. The sequences of such 5′ ESTs were compared to one anotherand the longest 5′ ESTs for each mRNA were identified. Overlapping cDNAswere assembled into continuous sequences (contigs). The resultingcontinuous sequences were then compared to public databases to gaugetheir similarity to known sequences, as described in Example 21 below.

EXAMPLE 21 Clustering of the 5′ ESTs and Calculation of Novelty Indicesfor cDNA Libraries

For each sequenced EST library, the sequences were clustered by the 5′end. Each sequence in the library was compared to the others withBLASTN2 (direct strand, parameters S=107). ESTs with High ScoringSegment Pairs (HSPs) at least 25 bp long, having 95% identical bases andbeginning closer than 10 bp from each EST 5′ end were grouped. Thelongest sequence found in the cluster was used as representative of thecluster. A global clustering between libraries was then performedleading to the definition of super-contigs.

To assess the yield of new sequences within the EST libraries, a noveltyrate (NR) was defined as: NR=100× (Number of new unique sequences foundin the library/Total number of sequences from the library). Typically,novelty rating range between 10% and 41% depending on the tissue fromwhich the EST library was obtained. For most of the libraries, therandom sequencing of 5′ EST libraries was pursued until the novelty ratereached 20%.

Following characterization as described above, the collection of 5′ ESTsin NETGENE™ was screened to identify those 5′ ESTs bearing potentialsignal sequences as described in Example 22 below.

EXAMPLE 22 Identification of Potential Signal Sequences in 5′ ESTs

The 5′ ESTs in the NETGENE™ database were screened to identify thosehaving an uninterrupted open reading frame (ORF) longer than 45nucleotides beginning with an ATG codon and extending to the end of theEST. Approximately half of the cDNA sequences in NETGENE™ contained suchan ORF. The ORFs of these 5′ ESTs were searched to identify potentialsignal motifs using slight modifications of the procedures disclosed inVon Heijne, G. A New Method for Predicting Signal Sequence CleavageSites. Nucleic Acids Res. 14:4683-4690 (1986), the disclosure of whichis incorporated herein by reference. Those 5′ EST sequences encoding a15 amino acid long stretch with a score of at least 3.5 in the VonHeijne signal peptide identification matrix were considered to possess asignal sequence. Those 5′ ESTs which matched a known human mRNA or ESTsequence and had a 5′ end more than 20 nucleotides downstream of theknown 5′ end were excluded from further analysis. The remaining cDNAshaving signal sequences therein were included in a database calledSIGNALTAG™.

To confirm the accuracy of the above method for identifying signalsequences, the analysis of Example 23 was performed.

EXAMPLE 23 Confirmation of Accuracy of Identification of PotentialSignal Sequences in 5′ ESTs

The accuracy of the above procedure for identifying signal sequencesencoding signal peptides was evaluated by applying the method to the 43amino terminal amino acids of all human SwissProt proteins. The computedVon Heijne score for each protein was compared with the knowncharacterization of the protein as being a secreted protein or anon-secreted protein. In this manner, the number of non-secretedproteins having a score higher than 3.5 (false positives) and the numberof secreted proteins having a score lower than 3.5 (false negatives)could be calculated.

Using the results of the above analysis, the probability that a peptideencoded by the 5′ region of the mRNA is in fact a genuine signal peptidebased on its Von Heijne's score was calculated based on either theassumption that 10% of human proteins are secreted or the assumptionthat 20% of human proteins are secreted. The results of this analysisare shown in FIGS. 2 and 3.

Using the above method of identifying secretory proteins, 5′ ESTs forhuman glucagon, gamma interferon induced monokine precursor, secretedcyclophilin-like protein, human pleiotropin, and human biotimidaseprecursor all of which are polypeptides which are known to be secreted,were obtained. Thus, the above method successfully identified those 5′ESTs which encode a signal peptide.

To confirm that the signal peptide encoded by the 5′ ESTs actuallyfunctions as a signal peptide, the signal sequences from the 5′ ESTs maybe cloned into a vector designed for the identification of signalpeptides. Some signal peptide identification vectors are designed toconfer the ability to grow in selective medium on host cells which havea signal sequence operably inserted into the vector. For example, toconfirm that a 5′ EST encodes a genuine signal peptide, the signalsequence of the 5′ EST may be inserted upstream and in frame with anon-secreted form of the yeast invertase gene in signal peptideselection vectors such as those described in U.S. Pat. No. 5,536,637,the disclosure of which is incorporated herein by reference. Growth ofhost cells containing signal sequence selection vectors having thesignal sequence from the 5′ EST inserted therein confirms that the 5′EST encodes a genuine signal peptide.

Alternatively, the presence of a signal peptide may be confirmed bycloning the extended cDNAs obtained using the ESTs into expressionvectors such as pXT1 (as described below), or by constructingpromoter-signal sequence-reporter gene vectors which encode fusionproteins between the signal peptide and an assayable reporter protein.After introduction of these vectors into a suitable host cell, such asCOS cells or NIH 3T3 cells, the growth medium may be harvested andanalyzed for the presence of the secreted protein. The medium from thesecells is compared to the medium from cells containing vectors lackingthe signal sequence or extended cDNA insert to identify vectors whichencode a functional signal peptide or an authentic secreted protein.

Those 5′ ESTs which encoded a signal peptide, as determined by themethod of Example 22 above, were further grouped into four categoriesbased on their homology to known sequences. The categorization of the 5′ESTs is described in Example 24 below.

EXAMPLE 24 Categorization of 5′ ESTs Encoding a Signal Peptide

Those 5′ ESTs having a sequence not matching any known vertebratesequence nor any publicly available EST sequence were designated “new.”Of the sequences in the SIGNALTAG™ database, 947 of the 5‘ESTs having aVon Heijne’s score of at least 3.5 fell into this category.

Those 5′ ESTs having a sequence not matching any vertebrate sequence butmatching a publicly known EST were designated “EST-ext”, provided thatthe known EST sequence was extended by at least 40 nucleotides in the 5′direction. Of the sequences in the SIGNALTAG™ database, 150 of the5‘ESTs having a Von Heijne’s score of at least 3.5 fell into thiscategory.

Those ESTs not matching any vertebrate sequence but matching a publiclyknown EST without extending the known EST by at least 40 nucleotides inthe 5′ direction were designated “EST.” Of the sequences in theSIGNALTAG™ database, 599 of the 5‘ESTs having a Von Heijne’s score of atleast 3.5 fell into this category.

Those 5′ ESTs matching a human mRNA sequence but extending the knownsequence by at least 40 nucleotides in the 5′ direction were designated“VERT-ext.” Of the sequences in the SIGNALTAG™ database, 23 of the5‘ESTs having a Von Heijne’s score of at least 3.5 fell into thiscategory. Included in this category was a 5′ EST which extended theknown sequence of the human translocase mRNA by more than 200 bases inthe 5′ direction. A 5′ EST which extended the sequence of a human tumorsuppressor gene in the 5′ direction was also identified.

FIG. 4 shows the distribution of 5′ ESTs in each category and the numberof 5′ ESTs in each category having a given minimum von Heijne's score.

Each of the 5′ ESTs was categorized based on the tissue from which itscorresponding mRNA was obtained, as described below in Example 25.

EXAMPLE 25 Categorization of Expression Patterns

FIG. 5 shows the tissues from which the mRNAs corresponding to the 5′ESTs in each of the above described categories were obtained.

In addition to categorizing the 5′ ESTs by the tissue from which thecDNA library in which they were first identified was obtained, thespatial and temporal expression patterns of the mRNAs corresponding tothe 5′ ESTs, as well as their expression levels, may be determined asdescribed in Example 26 below. Characterization of the spatial andtemporal expression patterns and expression levels of these mRNAs isuseful for constructing expression vectors capable of producing adesired level of gene product in a desired spatial or temporal manner,as will be discussed in more detail below.

In addition, 5′ ESTs whose corresponding mRNAs are associated withdisease states may also be identified. For example, a particular diseasemay result from lack of expression, over expression, or under expressionof an mRNA corresponding to a 5′ EST. By comparing mRNA expressionpatterns and quantities in samples taken from healthy individuals withthose from individuals suffering from a particular disease, 5′ ESTsresponsible for the disease may be identified.

It will be appreciated that the results of the above characterizationprocedures for 5′ ESTs also apply to extended cDNAs (obtainable asdescribed below) which contain sequences adjacent to the 5′ ESTs. Itwill also be appreciated that if it is desired to defer characterizationuntil extended cDNAs have been obtained rather than characterizing theESTs themselves, the above characterization procedures can be applied tocharacterize the extended cDNAs after their isolation.

EXAMPLE 26 Evaluation of Expression Levels and Patterns of mRNAsCorresponding to 5′ ESTs or Extended cDNAs

Expression levels and patterns of mRNAs corresponding to 5′ ESTs orextended cDNAs (obtainable as described below) may be analyzed bysolution hybridization with long probes as described in InternationalPatent Application No. WO 97/05277, the entire contents of which arehereby incorporated by reference. Briefly, a 5′ EST, extended cDNA, orfragment thereof corresponding to the gene encoding the mRNA to becharacterized is inserted at a cloning site immediately downstream of abacteriophage (T3, T7 or SP6) RNA polymerase promoter to produceantisense RNA. Preferably, the 5′ EST or extended cDNA has 100 or morenucleotides. The plasmid is linearized and transcribed in the presenceof ribonucleotides comprising modified ribonucleotides (i.e. biotin-UTPand DIG-UTP). An excess of this doubly labeled RNA is hybridized insolution with mRNA isolated from cells or tissues of interest. Thehybridizations are performed under standard stringent conditions (40-50°C. for 16 hours in an 80% formamide, 0.4 M NaCl buffer, pH 7-8). Theunhybridized probe is removed by digestion with ribonucleases specificfor single-stranded RNA (i.e. RNases CL3, T1, Phy M, U2 or A). Thepresence of the biotin-UTP modification enables capture of the hybrid ona microtitration plate coated with streptavidin. The presence of the DIGmodification enables the hybrid to be detected and quantified by ELISAusing an anti-DIG antibody coupled to alkaline phosphatase.

The 5′ ESTs, extended cDNAs, or fragments thereof may also be taggedwith nucleotide sequences for the serial analysis of gene expression(SAGE) as disclosed in UK Patent Application No. 2 305 241 A, the entirecontents of which are incorporated by reference. In this method, cDNAsare prepared from a cell, tissue, organism or other source of nucleicacid for which it is desired to determine gene expression patterns. Theresulting cDNAs are separated into two pools. The cDNAs in each pool arecleaved with a first restriction endonuclease, called an “anchoringenzyme,” having a recognition site which is likely to be present atleast once in most cDNAs. The fragments which contain the 5′ or 3′ mostregion of the cleaved cDNA are isolated by binding to a capture mediumsuch as streptavidin coated beads. A first oligonucleotide linker havinga first sequence for hybridization of an amplification primer and aninternal restriction site for a “tagging endonuclease” is ligated to thedigested cDNAs in the first pool. Digestion with the second endonucleaseproduces short “tag” fragments from the cDNAs.

A second oligonucleotide having a second sequence for hybridization ofan amplification primer and an internal restriction site is ligated tothe digested cDNAs in the second pool. The cDNA fragments in the secondpool are also digested with the “tagging endonuclease” to generate short“tag” fragments derived from the cDNAs in the second pool. The “tags”resulting from digestion of the first and second pools with theanchoring enzyme and the tagging endonuclease are ligated to one anotherto produce “ditags.” In some embodiments, the ditags are concatamerizedto produce ligation products containing from 2 to 200 ditags. The tagsequences are then determined and compared to the sequences of the 5′ESTs or extended cDNAs to determine which 5′ ESTs or extended cDNAs areexpressed in the cell, tissue, organism, or other source of nucleicacids from which the tags were derived. In this way, the expressionpattern of the 5′ ESTs or extended cDNAs in the cell, tissue, organism,or other source of nucleic acids is obtained.

Quantitative analysis of gene expression may also be performed usingarrays. As used herein, the term array means a one dimensional, twodimensional, or multidimensional arrangement of full length cDNAs (i.e.extended cDNAs which include the coding sequence for the signal peptide,the coding sequence for the mature protein, and a stop codon), extendedcDNAs, 5′ ESTs or fragments of the full length cDNAs, extended cDNAs, or5′ ESTs of sufficient length to permit specific detection of geneexpression. Preferably, the fragments are at least 15 nucleotides inlength. More preferably, the fragments are at least 100 nucleotides inlength. More preferably, the fragments are more than 100 nucleotides inlength. In some embodiments the fragments may be more than 500nucleotides in length.

For example, quantitative analysis of gene expression may be performedwith full length cDNAs, extended cDNAs, 5′ ESTs, or fragments thereof ina complementary DNA microarray as described by Schena et al. (Science270:467-470, 1995; Proc. Natl. Acad. Sci. U.S.A. 93:10614-10619, 1996).Full length cDNAs, extended cDNAs, 5′ ESTs or fragments thereof areamplified by PCR and arrayed from 96-well microtiter plates ontosilylated microscope slides using high-speed robotics. Printed arraysare incubated in a humid chamber to allow rehydration of the arrayelements and rinsed, once in 0.2% SDS for 1 min, twice in water for 1min and once for 5 min in sodium borohydride solution. The arrays aresubmerged in water for 2 min at 95° C., transferred into 0.2% SDS for 1min, rinsed twice with water, air dried and stored in the dark at 25° C.

Cell or tissue mRNA is isolated or commercially obtained and probes areprepared by a single round of reverse transcription. Probes arehybridized to 1 cm2 microarrays under a 14×14 mm glass coverslip for6-12 hours at 60° C. Arrays are washed for 5 min at 25° C. in lowstringency wash buffer (1×SSC/0.2% SDS), then for 10 min at roomtemperature in high stringency wash buffer (0.1×SSC/0.2% SDS). Arraysare scanned in 0.1×SSC using a fluorescence laser scanning device fittedwith a custom filter set. Accurate differential expression measurementsare obtained by taking the average of the ratios of two independenthybridizations.

Quantitative analysis of the expression of genes may also be performedwith full length cDNAs, extended cDNAs, 5′ ESTs, or fragments thereof incomplementary DNA arrays as described by Pietu et al. (Genome Research6:492-503, 1996). The full length cDNAs, extended cDNAs, 5′ ESTs orfragments thereof are PCR amplified and spotted on membranes. Then,mRNAs originating from various tissues or cells are labeled withradioactive nucleotides. After hybridization and washing in controlledconditions, the hybridized mRNAs are detected by phospho-imaging orautoradiography. Duplicate experiments are performed and a quantitativeanalysis of differentially expressed mRNAs is then performed.

Alternatively, expression analysis of the 5′ ESTs or extended cDNAs canbe done through high density nucleotide arrays as described by Lockhartet al. (Nature Biotechnology 14: 1675-1680, 1996) and Sosnowsky et al.(Proc. Natl. Acad. Sci. 94:1119-1123, 1997). Oligonucleotides of 15-50nucleotides corresponding to sequences of the 5′ ESTs or extended cDNAsare synthesized directly on the chip (Lockhart et al., supra) orsynthesized and then addressed to the chip (Sosnowski et al., supra).Preferably, the oligonucleotides are about 20 nucleotides in length.

cDNA probes labeled with an appropriate compound, such as biotin,digoxigenin or fluorescent dye, are synthesized from the appropriatemRNA population and then randomly fragmented to an average size of 50 to100 nucleotides. The said probes are then hybridized to the chip. Afterwashing as described in Lockhart et al., supra and application ofdifferent electric fields (Sosnowsky et al., Proc. Natl. Acad. Sci.94:1119-1123), the dyes or labeling compounds are detected andquantified. Duplicate hybridizations are performed. Comparative analysisof the intensity of the signal originating from cDNA probes on the sametarget oligonucleotide in different cDNA samples indicates adifferential expression of the mRNA corresponding to the 5′ EST orextended cDNA from which the oligonucleotide sequence has been designed.

III. Use of 5′ ESTs to Clone Extended cDNAs and to Clone theCorresponding Genomic DNAs

Once 5′ ESTs which include the 5′ end of the corresponding mRNAs havebeen selected using the procedures described above, they can be utilizedto isolate extended cDNAs which contain sequences adjacent to the 5′ESTs. The extended cDNAs may include the entire coding sequence of theprotein encoded by the corresponding mRNA, including the authentictranslation start site, the signal sequence, and the sequence encodingthe mature protein remaining after cleavage of the signal peptide. Suchextended cDNAs are referred to herein as “full length cDNAs.”Alternatively, the extended cDNAs may include only the sequence encodingthe mature protein remaining after cleavage of the signal peptide, oronly the sequence encoding the signal peptide.

Example 27 below describes a general method for obtaining extendedcDNAs. Example 28 below describes the cloning and sequencing of severalextended cDNAs, including extended cDNAs which include the entire codingsequence and authentic 5′ end of the corresponding mRNA for severalsecreted proteins.

The methods of Examples 27, 28, and 29 can also be used to obtainextended cDNAs which encode less than the entire coding sequence of thesecreted proteins encoded by the genes corresponding to the 5′ ESTs. Insome embodiments, the extended cDNAs isolated using these methods encodeat least 10 amino acids of one of the proteins encoded by the sequencesof SEQ ID NOs: 40-84 and 130-154. In further embodiments, the extendedcDNAs encode at least 20 amino acids of the proteins encoded by thesequences of SEQ ID NOs: 40-84 and 130-154. In further embodiments, theextended cDNAs encode at least 30 amino amino acids of the sequences ofSEQ ID NOs: 40-84 and 130-154. In a preferred embodiment, the extendedcDNAs encode a full length protein sequence, which includes the proteincoding sequences of SEQ ID NOs: 40-84 and 130-154.

EXAMPLE 27 General Method for Using 5′ ESTs to Clone and SequenceExtended cDNAs

The following general method has been used to quickly and efficientlyisolate extended cDNAs including sequence adjacent to the sequences ofthe 5′ ESTs used to obtain them. This method may be applied to obtainextended cDNAs for any 5′ EST in the NETGENE™ database, including those5′ ESTs encoding secreted proteins. The method is summarized in FIG. 6.

1. Obtaining Extended cDNAs

a) First Strand Synthesis

The method takes advantage of the known 5′ sequence of the mRNA. Areverse transcription reaction is conducted on purified mRNA with a poly14dT primer containing a 49 nucleotide sequence at its 5′ end allowingthe addition of a known sequence at the end of the cDNA whichcorresponds to the 3′ end of the mRNA. For example, the primer may havethe following sequence: 5′-ATC GTT GAG ACT CGT ACC AGC AGA GTC ACG AGAGAG ACT ACA CGG TAC TGG TTT TTT TTT TTT TTVN-3′ (SEQ ID NO:14). Thoseskilled in the art will appreciate that other sequences may also beadded to the poly dT sequence and used to prime the first strandsynthesis. Using this primer and a reverse transcriptase such as theSUPERSCRIPT II (Gibco BRL) or RNASE H MINUS M-MLV (Promega) enzyme, areverse transcript anchored at the 3′ polyA site of the RNAs isgenerated.

After removal of the mRNA hybridized to the first cDNA strand byalkaline hydrolysis, the products of the alkaline hydrolysis and theresidual poly dT primer are eliminated with an exclusion column such asan ACA34 (Biosepra) matrix as explained in Example 11.

b) Second Strand Synthesis

A pair of nested primers on each end is designed based on the known 5′sequence from the 5′ EST and the known 3′ end added by the poly dTprimer used in the first strand synthesis. Software used to designprimers are either based on GC content and melting temperatures ofoligonucleotides, such as OSP (Illier and Green, PCR Meth. Appl.1:124-128, 1991), or based on the octamer frequency disparity method(Griffais et al., Nucleic Acids Res. 19: 3887-3891, 1991 such as PC-Rare(http://bioinformatics.weizrnann.ac.il/software/PC-Rare/doc/manuel.html).

Preferably, the nested primers at the 5′ end are separated from oneanother by four to nine bases. The 5′ primer sequences may be selectedto have melting temperatures and specificities suitable for use in PCR.

Preferably, the nested primers at the 3′ end are separated from oneanother by four to nine bases. For example, the nested 3′ primers mayhave the following sequences: (5′-CCA GCA GAG TCA CGA GAG AGA CTA CACGG-3′(SEQ ID NO:15), and 5′-CAC GAG AGA GAC TAC ACG GTA CTG G-3′ (SEQ IDNO:16). These primers were selected because they have meltingtemperatures and specificities compatible with their use in PCR.However, those skilled in the art will appreciate that other sequencesmay also be used as primers.

The first PCR run of 25 cycles is performed using the ADVANTAGE TTHPOLYMERASE MIX (Clontech) and the outer primer from each of the nestedpairs. A second 20 cycle PCR using the same enzyme and the inner primerfrom each of the nested pairs is then performed on 1/2500 of the firstPCR product. Thereafter, the primers and nucleotides are removed.

2. Sequencing of Full Length Extended cDNAs or Fragments Thereof

Due to the lack of position constraints on the design of 5′ nestedprimers compatible for PCR use using the OSP software, amplicons of twotypes are obtained. Preferably, the second 5′ primer is located upstreamof the translation initiation codon thus yielding a nested PCR productcontaining the whole coding sequence. Such a full length extended cDNAundergoes a direct cloning procedure as described in section a below.However, in some cases, the second 5′ primer is located downstream ofthe translation initiation codon, thereby yielding a PCR productcontaining only part of the ORF. Such incomplete PCR products aresubmitted to a modified procedure described in section b below.

a) Nested PCR Products Containing Complete ORFs

When the resulting nested PCR product contains the complete codingsequence, as predicted from the 5′EST sequence, it is cloned in anappropriate vector such as pED6dpc2, as described in section 3.

b) Nested PCR Products Containing Incomplete ORFs

When the amplicon does not contain the complete coding sequence,intermediate steps are necessary to obtain both the complete codingsequence and a PCR product containing the full coding sequence. Thecomplete coding sequence can be assembled from several partial sequencesdetermined directly from different PCR products as described in thefollowing section.

Once the full coding sequence has been completely determined, newprimers compatible for PCR use are designed to obtain ampliconscontaining the whole coding region. However, in such cases, 3′ primerscompatible for PCR use are located inside the 3′ UTR of thecorresponding mRNA, thus yielding amplicons which lack part of thisregion, i.e. the polyA tract and sometimes the polyadenylation signal,as illustrated in FIG. 6. Such full length extended cDNAs are thencloned into an appropriate vector as described in section 3.

c) Sequencing Extended cDNAs

Sequencing of extended cDNAs can be performed using a Die Terminatorapproach with the AMPLITAQ DNA polymerase FS kit available from PerkinElmer.

In order to sequence PCR fragments, primer walking is performed usingsoftware such as OSP to choose primers and automated computer softwaresuch as ASMG (Sutton et al., Genome Science Technol. 1: 9-19, 1995) toconstruct contigs of walking sequences including the initial 5′ tagusing minimum overlaps of 32 nucleotides. Preferably, primer walking isperformed until the sequences of full length cDNAs are obtained.

Completion of the sequencing of a given extended cDNA fragment isassessed as follows. Since sequences located after a polyA tract aredifficult to determine precisely in the case of uncloned products,sequencing and primer walking processes for PCR products are interruptedwhen a polyA tract is identified in extended cDNAs obtained as describedin case b. The sequence length is compared to the size of the nested PCRproduct obtained as described above. Due to the limited accuracy of thedetermination of the PCR product size by gel electrophoresis, a sequenceis considered complete if the size of the obtained sequence is at least70% the size of the first nested PCR product. If the length of thesequence determined from the computer analysis is not at least 70% ofthe length of the nested PCR product, these PCR products are cloned andthe sequence of the insertion is determined. When Northern blot data areavailable, the size of the mRNA detected for a given PCR product is usedto finally assess that the sequence is complete. Sequences which do notfulfill the above criteria are discarded and will undergo a newisolation procedure.

Sequence data of all extended cDNAs are then transferred to aproprietary database, where quality controls and validation steps arecarried out as described in example 15.

3. Cloning of Full Length Extended cDNAs

The PCR product containing the full coding sequence is then cloned in anappropriate vector. For example, the extended cDNAs can be cloned intothe expression vector pED6dpc2 (DiscoverEase, Genetics Institute,Cambridge, Mass.) as follows. The structure of pED6dpc2 is shown in FIG.7. pED6dpc2 vector DNA is prepared with blunt ends by performing anEcoRI digestion followed by a fill in reaction. The blunt ended vectoris dephosphorylated. After removal of PCR primers and ethanolprecipitation, the PCR product containing the full coding sequence orthe extended cDNA obtained as described above is phosphorylated with akinase subsequently removed by phenol-Sevag extraction andprecipitation. The double stranded extended cDNA is then ligated to thevector and the resulting expression plasmid introduced into appropriatehost cells.

Since the PCR products obtained as described above are blunt endedmolecules that can be cloned in either direction, the orientation ofseveral clones for each PCR product is determined. Then, 4 to 10 clonesare ordered in microtiter plates and subjected to a PCR reaction using afirst primer located in the vector close to the cloning site and asecond primer located in the portion of the extended cDNA correspondingto the 3′ end of the mRNA. This second primer may be the antisenseprimer used in anchored PCR in the case of direct cloning (case a) orthe antisense primer located inside the 3′UTR in the case of indirectcloning (case b). Clones in which the start codon of the extended cDNAis operably linked to the promoter in the vector so as to permitexpression of the protein encoded by the extended cDNA are conserved andsequenced. In addition to the ends of cDNA inserts, approximately 50 bpof vector DNA on each side of the cDNA insert are also sequenced.

The cloned PCR products are then entirely sequenced according to theaforementioned procedure. In this case, contig assembly of longfragments is then performed on walking sequences that have alreadycontigated for uncloned PCR products during primer walking. Sequencingof cloned amplicons is complete when the resulting contigs include thewhole coding region as well as overlapping sequences with vector DNA onboth ends.

4. Computer Analysis of Full Length Extended cDNA

Sequences of all full length extended cDNAs may then be subjected tofurther analysis as described below and using the parameters found inTable II with the following modifications. For screening ofmiscellaneous subdivisions of Genbank, FASTA was used instead of BLASTNand 15 nucleotide of homology was the limit instead of 17. For Aludetection, BLASTN was used with the following parameters: S=72;identity=70%; and length=40 nucleotides. Polyadenylation signal andpolyA tail which were not search for the 5′ ESTs were searched. Forpolyadenylation signal detection the signal (AATAAA) was searched withone permissible mismatch in the last fifty nucleotides preceding the 5′end of the polyA. For the polyA, a stretch of 8 amino acids in the last20 nucleotides of the sequence was searched with BLAST2N in the sensestrand with the following parameters (W=6, S=10, E=1000, andidentity=90%). Finally, patented sequences and ORF homologies weresearched using, respectively, BLASTN and BLASTP on GenSEQ (Derwent'sdatabase of patented nucleotide sequences) and SWISSPROT for ORFs withthe following parameters (W=8 and B=10). Before examining the extendedfull length cDNAs for sequences of interest, extended cDNAs which arenot of interest are searched as follows.

a) Elimination of Undesired Sequences

Although 5′ESTs were checked to remove contaminants sequences asdescribed in Example 18, a last verification was carried out to identifyextended cDNAs sequences derived from undesired sequences such as vectorRNAs, transfer RNAs, ribosomal rRNAs, mitochondrial RNAs, prokaryoticRNAs and fungal RNAs using the FASTA and BLASTN programs on both strandsof extended cDNAs as described below.

To identify the extended cDNAs encoding vector RNAs, extended cDNAs arecompared to the known sequences of vector RNA using the FASTA program.Sequences of extended cDNAs with more than 90% homology over stretchesof 15 nucleotides are identified as vector RNA.

To identify the extended cDNAs encoding tRNAs, extended cDNA sequenceswere compared to the sequences of 1190 known tRNAs obtained from EMBLrelease 38, of which 100 were human. Sequences of extended cDNAs havingmore than 80% homology over 60 nucleotides using FASTA were identifiedas tRNA.

To identify the extended cDNAs encoding rRNAs, extended cDNA sequenceswere compared to the sequences of 2497 known rRNAs obtained from EMBLrelease 38, of which 73 were human. Sequences of extended cDNAs havingmore than 80% homology over stretches longer than 40 nucleotides usingBLASTN were identified as rRNAs.

To identify the extended cDNAs encoding mtRNAs, extended cDNA sequenceswere compared to the sequences of the two known mitochondrial genomesfor which the entire genomic sequences are available and all sequencestranscribed from these mitochondrial genomes including tRNAs, rRNAs, andmRNAs for a total of 38 sequences. Sequences of extended cDNAs havingmore than 80% homology over stretches longer than 40 nucleotides usingBLASTN were identified as mtRNAs.

Sequences which might have resulted from other exogenous contaminantswere identified by comparing extended cDNA sequences to release 105 ofGenbank bacterial and fungal divisions. Sequences of extended cDNAshaving more than 90% homology over 40 nucleotides using BLASTN wereidentified as exogenous prokaryotic or fungal contaminants.

In addition, extended cDNAs were searched for different repeatsequences, including Alu sequences, L1 sequences, THE and MER repeats,SSTR sequences or satellite, micro-satellite, or telomeric repeats.Sequences of extended cDNAs with more than 70% homology over 40nucleotide stretches using BLASTN were identified as repeat sequencesand masked in further identification procedures. In addition, clonesshowing extensive homology to repeats, i.e., matches of either more than50 nucleotides if the homology was at least 75% or more than 40nucleotides if the homology was at least 85% or more than 30 nucleotidesif the homology was at least 90%, were flagged.

b) Identification of Structural Features

Structural features, e.g. polyA tail and polyadenylation signal, of thesequences of full length extended cDNAs are subsequently determined asfollows.

A polyA tail is defined as a homopolymeric stretch of at least 11 A withat most one alternative base within it. The polyA tail search isrestricted to the last 20 nt of the sequence and limited to stretches of11 consecutive A's because sequencing reactions are often not readableafter such a polyA stretch. Stretches with 100% homology over 6nucleotides are identified as polyA tails.

To search for a polyadenylation signal, the polyA tail is clipped fromthe full-length sequence. The 50 bp preceding the polyA tail are firstsearched for the canonic polyadenylation AAUAAA signal and, if thecanonic signal is not detected, for the alternative AUUAAA signal(Sheets et al., Nuc. Acids Res. 18: 5799-5805, 1990). If neither ofthese consensus polyadenylation signals is found, the canonic motif issearched again allowing one mismatch to account for possible sequencingerrors. More than 85% of identified polyadenylation signals of eithertype actually ends 10 to 30 bp from the polyA tail. Alternative AUUAAAsignals represents approximately 15% of the total number of identifiedpolyadenylation signals.

To search for a polyadenylation signal, the polyA tail is clipped fromthe full-length sequence. The 50 bp preceding the polyA tail aresearched for the canonic polyadenylation AAUAAA signal allowing onemismatch to account for possible sequencing errors and known variationin the canonical sequence of the polyadenylation signal.

c) Identification of Functional Features

Functional features, e.g. ORFs and signal sequences, of the sequences offull length extended cDNAs were subsequently determined as follows.

The 3 upper strand frames of extended cDNAs are searched for ORFsdefined as the maximum length fragments beginning with a translationinitiation codon and ending with a stop codon. ORFs encoding at least 20amino acids are preferred.

Each found ORF is then scanned for the presence of a signal peptide inthe first 50 amino-acids or, where appropriate, within shorter regionsdown to 20 amino acids or less in the ORF, using the matrix method ofvon Heijne (Nuc. Acids Res. 14: 4683-4690 (1986)), the disclosure ofwhich is incorporated herein by reference and the modification describedin Example 22.

d) Homology to either Nucleotidic or Proteic Sequences

Sequences of full length extended cDNAs are then compared to knownsequences on a nucleotidic or proteic basis.

Sequences of full length extended cDNAs are compared to the followingknown nucleic acid sequences: vertebrate sequences, EST sequences,patented sequences and recently identified sequences available at thetime of filing the priority documents. Full length cDNA sequences arealso compared to the sequences of a private database (Genset internalsequences) in order to find sequences that have already been identifiedby applicants. Sequences of full length extended cDNAs with more than90% homology over 30 nucleotides using either BLASTN or BLAST2N asindicated in Table III are identified as sequences that have alreadybeen described. Matching vertebrate sequences are subsequently examinedusing FASTA; full length extended cDNAs with more than 70% homology over30 nucleotides are identified as sequences that have already beendescribed.

ORFs encoded by full length extended cDNAs as defined in section c) aresubsequently compared to known amino acid sequences found in publicdatabases using Swissprot, PIR and Genptept releases available at thetime of filing the priority documents for the present application. Theseanalyses were performed using BLASTP with the parameter W=8 and allowinga maximum of 10 matches. Sequences of full length extended cDNAs showingextensive homology to known protein sequences are recognized as alreadyidentified proteins.

In addition, the three-frame conceptual translation products of the topstrand of full length extended cDNAs are compared to publicly knownamino acid sequences of Swissprot using BLASTX with the parameterE=0.001. Sequences of full length extended cDNAs with more than 70%homology over 30 amino acid stretches are detected as already identifiedproteins.

As used herein the term “cDNA codes of SEQ ID NOs. 40-84 and 130-154”encompasses the nucleotide sequences of SEQ ID NOs. 40-84 and 130-154,fragments of SEQ ID NOs. 40-84 and 130-154, nucleotide sequenceshomologous to SEQ ID NOs. 40-84 and 130-154 or homologous to fragmentsof SEQ ID NOs. 40-84 and 130-154, and sequences complementary to all ofthe preceding sequences. The fragments include portions of SEQ ID NOs.40-84 and 130-154 comprising at least 10, 15, 20, 25, 30, 35, 40, 50,75, 100, 150, 200, 300, 400, or 500 consecutive nucleotides of SEQ IDNOs. 40-84 and 130-154. Preferably, the fragments are novel fragments.Homologous sequences and fragments of SEQ ID NOs. 40-84 and 130-154refer to a sequence having at least 99%, 98%, 97%, 96%, 95%, 90%, 85%,80%, or 75% homology to these sequences. Homology may be determinedusing any of the computer programs and parameters described herein,including BLAST2N with the default parameters or with any modifiedparameters. Homologous sequences also include RNA sequences in whichuridines replace the thymines in the cDNA codes of SEQ ID NOs. 40-84 and130-154. The homologous sequences may be obtained using any of theprocedures described herein or may result from the correction of asequencing error as described above. It will be appreciated that thecDNA codes of SEQ ID NOs. 40-84 and 130-154 can be represented in thetraditional single character format (See the inside back cover ofStarrier, Lubert. Biochemistry, 3^(rd) edition. W.H Freeman & Co., NewYork.) or in any other format which records the identity of thenucleotides in a sequence.

As used herein the term “polypeptide codes of SEQ ID NOS. 85-129 and155-179” encompasses the polypeptide sequence of SEQ ID NOs. 85-129 and155-179 which are encoded by the extended cDNAs of SEQ ID NOs. 40-84 and130-154, polypeptide sequences homologous to the polypeptides of SEQ IDNOS. 85-129 and 155-179, or fragments of any of the preceding sequences.Homologous polypeptide sequences refer to a polypeptide sequence havingat least 99%, 98%, 97%, 96%, 95%, 90%, 85%, 80%, 75% homology to one ofthe polypeptide sequences of SEQ ID NOS. 85-129 and 155-179. Homologymay be determined using any of the computer programs and parametersdescribed herein, including FASTA with the default parameters or withany modified parameters. The homologous sequences may be obtained usingany of the procedures described herein or may result from the correctionof a sequencing error as described above. The polypeptide fragmentscomprise at least 5, 10, 15, 20, 25, 30, 35, 40, 50, 75, 100, or 150consecutive amino acids of the polypeptides of SEQ ID NOS. 85-129 and155-179. Preferably, the fragments are novel fragments. It will beappreciated that the polypeptide codes of the SEQ ID NOS. 85-129 and155-179 can be represented in the traditional single character format orthree letter format (See the inside back cover of Starrier, Lubert.Biochemistry, 3^(rd) edition. W.H Freeman & Co., New York.) or in anyother format which relates the identity of the polypeptides in asequence.

It will be appreciated by those skilled in the art that the cDNA codesof SEQ ID NOs. 40-84 and 130-154 and polypeptide codes of SEQ ID NOS.85-129 and 155-179 can be stored, recorded, and manipulated on anymedium which can be read and accessed by a computer. As used herein, thewords “recorded” and “stored” refer to a process for storing informationon a computer medium. A skilled artisan can readily adopt any of thepresently known methods for recording information on a computer readablemedium to generate manufactures comprising one or more of the cDNA codesof SEQ ID NOs. 40-84 and 130-154, one or more of the polypeptide codesof SEQ ID NOS. 85-129 and 155-179. Another aspect of the presentinvention is a computer readable medium having recorded thereon at least2, 5, 10, 15, 20, 25, 30, or 50 cDNA codes of SEQ ID NOs. 40-84 and130-154. Another aspect of the present invention is a computer readablemedium having recorded thereon at least 2, 5, 10, 15, 20, 25, 30, or 50polypeptide codes of SEQ ID NOS. 85-129 and 155-179.

Computer readable media include magnetically readable media, opticallyreadable media, electronically readable media and magnetic/opticalmedia. For example, the computer readable media may be a hard disc, afloppy disc, a magnetic tape, CD-ROM, DVD, RAM, or ROM as well as othertypes of other media known to those skilled in the art.

Embodiments of the present invention include systems, particularlycomputer systems which contain the sequence information describedherein. As used herein, “a computer system” refers to the hardwarecomponents, software components, and data storage components used toanalyze the nucleotide sequences of the cDNA codes of SEQ ID NOs. 40-84and 130-154, or the amino acid sequences of the polypeptide codes of SEQID NOS. 85-129 and 155-179. The computer system preferably includes thecomputer readable media described above, and a processor for accessingand manipulating the sequence data.

Preferably, the computer is a general purpose system that comprises acentral processing unit (CPU), one or more data storage components forstoring data, and one or more data retrieving devices for retrieving thedata stored on the data storage components. A skilled artisan canreadily appreciate that any one of the currently available computersystems are suitable.

In one particular embodiment, the computer system includes a processorconnected to a bus which is connected to a main memory (preferablyimplemented as RAM) and one or more data storage devices, such as a harddrive and/or other computer readable media having data recorded thereon.In some embodiments, the computer system further includes one or moredata retrieving devices for reading the data stored on the data storagecomponents. The data retrieving device may represent, for example, afloppy disk drive, a compact disk drive, a magnetic tape drive, etc. Insome embodiments, the data storage component is a removable computerreadable medium such as a floppy disk, a compact disk, a magnetic tape,etc. containing control logic and/or data recorded thereon. The computersystem may advantageously include or be programmed by appropriatesoftware for reading the control logic and/or the data from the datastorage component once inserted in the data retrieving device. Softwarefor accessing and processing the nucleotide sequences of the cDNA codesof SEQ ID NOs. 40-84 and 130-154, or the amino acid sequences of thepolypeptide codes of SEQ ID NOS. 85-129 and 155-179 (such as searchtools, compare tools, and modeling tools etc.) may reside in main memoryduring execution.

In some embodiments, the computer system may further comprise a sequencecomparer for comparing the above-described cDNA codes of SEQ ID NOs.40-84 and 130-154 or polypeptide codes of SEQ ID NOS. 85-129 and 155-179stored on a computer readable medium to reference nucleotide orpolypeptide sequences stored on a computer readable medium. A “sequencecomparer” refers to one or more programs which are implemented on thecomputer system to compare a nucleotide or polypeptide sequence withother nucleotide or polypeptide sequences and/or compounds including butnot limited to peptides, peptidomimetics, and chemicals stored withinthe data storage means. For example, the sequence comparer may comparethe nucleotide sequences of the cDNA codes of SEQ ID NOs. 40-84 and130-154, or the amino acid sequences of the polypeptide codes of SEQ IDNOS. 85-129 and 155-179 stored on a computer readable medium toreference sequences stored on a computer readable medium to identifyhomologies, motifs implicated in biological function, or structuralmotifs. The various sequence comparer programs identified elsewhere inthis patent specification are particularly contemplated for use in thisaspect of the invention.

Accordingly, one aspect of the present invention is a computer systemcomprising a processor, a data storage device having stored thereon acDNA code of SEQ ID NOs. 40-84 and 130-154 or a polypeptide code of SEQID NOS. 85-129 and 155-179, a data storage device having retrievablystored thereon reference nucleotide sequences or polypeptide sequencesto be compared to the cDNA code of SEQ ID NOs. 40-84 and 130-154 orpolypeptide code of SEQ ID NOS. 85-129 and 155-179 and a sequencecomparer for conducting the comparison. The sequence comparer mayindicate a homology level between the sequences compared or identifystructural motifs in the above described cDNA code of SEQ ID NOs. 40-84and 130-154 and polypeptide codes of SEQ ID NOS. 85-129 and 155-179 orit may identify structural motifs in sequences which are compared tothese cDNA codes and polypeptide codes. In some embodiments, the datastorage device may have stored thereon the sequences of at least 2, 5,10, 15, 20, 25, 30, or 50 of the cDNA codes of SEQ ID NOs. 40-84 and130-154 or polypeptide codes of SEQ ID NOS. 85-129 and 155-179.

Another aspect of the present invention is a method for determining thelevel of homology between a cDNA code of SEQ ID NOs. 40-84 and 130-154and a reference nucleotide sequence, comprising the steps of reading thecDNA code and the reference nucleotide sequence through the use of acomputer program which determines homology levels and determininghomology between the cDNA code and the reference nucleotide sequencewith the computer program. The computer program may be any of a numberof computer programs for determining homology levels, including thosespecifically enumerated below, including BLAST2N with the defaultparameters or with any modified parameters. The method may beimplemented using the computer systems described above. The method mayalso be performed by reading 2, 5, 10, 15, 20, 25, 30, or 50 of theabove described cDNA codes of SEQ ID NOs. 40-84 and 130-154 through useof the computer program and determining homology between the cDNA codesand reference nucleotide sequences.

Alternatively, the computer program may be a computer program whichcompares the nucleotide sequences of the cDNA codes of the presentinvention, to reference nucleotide sequences in order to determinewhether the cDNA code of SEQ ID NOs. 40-84 and 130-154 differs from areference nucleic acid sequence at one or more positions. Optionallysuch a program records the length and identity of inserted, deleted orsubstituted nucleotides with respect to the sequence of either thereference polynucleotide or the cDNA code of SEQ ID NOs. 40-84 and130-154. In one embodiment, the computer program may be a program whichdetermines whether the nucleotide sequences of the cDNA codes of SEQ IDNOs. 40-84 and 130-154 contain a single nucleotide polymorphism (SNP)with respect to a reference nucleotide sequence. This single nucleotidepolymorphism may comprise a single base substitution, insertion, ordeletion.

Another aspect of the present invention is a method for determining thelevel of homology between a polypeptide code of SEQ ID NOS. 85-129 and155-179 and a reference polypeptide sequence, comprising the steps ofreading the polypeptide code of SEQ ID NOS. 85-129 and 155-179 and thereference polypeptide sequence through use of a computer program whichdetermines homology levels and determining homology between thepolypeptide code and the reference polypeptide sequence using thecomputer program.

Accordingly, another aspect of the present invention is a method fordetermining whether a cDNA code of SEQ ID NOs. 40-84 and 130-154 differsat one or more nucleotides from a reference nucleotide sequencecomprising the steps of reading the cDNA code and the referencenucleotide sequence through use of a computer program which identifiesdifferences between nucleic acid sequences and identifying differencesbetween the cDNA code and the reference nucleotide sequence with thecomputer program. In some embodiments, the computer program is a programwhich identifies single nucleotide polymorphisms. The method may beimplemented by the computer systems described above. The method may alsobe performed by reading at least 2, 5, 10, 15, 20, 25, 30, or 50 of thecDNA codes of SEQ ID NOs. 40-84 and 130-154 and the reference nucleotidesequences through the use of the computer program and identifyingdifferences between the cDNA codes and the reference nucleotidesequences with the computer program.

In other embodiments the computer based system may further comprise anidentifier for identifying features within the nucleotide sequences ofthe cDNA codes of SEQ ID NOs. 40-84 and 130-154 or the amino acidsequences of the polypeptide codes of SEQ ID NOS. 85-129 and 155-179.

An “identifier” refers to one or more programs which identifies certainfeatures within the above-described nucleotide sequences of the cDNAcodes of SEQ ID NOs. 40-84 and 130-154 or the amino acid sequences ofthe polypeptide codes of SEQ ID NOS. 85-129 and 155-179. In oneembodiment, the identifier may comprise a program which identifies anopen reading frame in the cDNAs codes of SEQ ID NOs. 40-84 and 130-154.

In another embodiment, the identifier may comprise a molecular modelingprogram which determines the 3-dimensional structure of the polypeptidescodes of SEQ ID NOS. 85-129 and 155-179. In some embodiments, themolecular modeling program identifies target sequences that are mostcompatible with profiles representing the structural environments of theresidues in known three-dimensional protein structures. (See, e.g.,Eisenberg et al., U.S. Pat. No. 5,436,850 issued Jul. 25, 1995). Inanother technique, the known three-dimensional structures of proteins ina given family are superimposed to define the structurally conservedregions in that family. This protein modeling technique also uses theknown three-dimensional structure of a homologous protein to approximatethe structure of the polypeptide codes of SEQ ID NOS. 85-129 and155-179. (See e.g., Srinivasan, et al., U.S. Pat. No. 5,557,535 issuedSep. 17, 1996). Conventional homology modeling techniques have been usedroutinely to build models of proteases and antibodies. (Sowdhamini etal., Protein Engineering 10:207, 215 (1997)). Comparative approaches canalso be used to develop three-dimensional protein models when theprotein of interest has poor sequence identity to template proteins. Insome cases, proteins fold into similar three-dimensional structuresdespite having very weak sequence identities. For example, thethree-dimensional structures of a number of helical cytokines fold insimilar three-dimensional topology in spite of weak sequence homology.

The recent development of threading methods now enables theidentification of likely folding patterns in a number of situationswhere the structural relatedness between target and template(s) is notdetectable at the sequence level. Hybrid methods, in which foldrecognition is performed using Multiple Sequence Threading (MST),structural equivalencies are deduced from the threading output using adistance geometry program DRAGON to construct a low resolution model,and a full-atom representation is constructed using a molecular modelingpackage such as QUANTA.

According to this 3-step approach, candidate templates are firstidentified by using the novel fold recognition algorithm MST, which iscapable of performing simultaneous threading of multiple alignedsequences onto one or more 3-D structures. In a second step, thestructural equivalencies obtained from the MST output are converted intointerresidue distance restraints and fed into the distance geometryprogram DRAGON, together with auxiliary information obtained fromsecondary structure predictions. The program combines the restraints inan unbiased manner and rapidly generates a large number of lowresolution model confirmations. In a third step, these low resolutionmodel confirmations are converted into full-atom models and subjected toenergy minimization using the molecular modeling package QUANTA. (Seee.g., Aszódi et al., Proteins:Structure, Function, and Genetics,Supplement 1:38-42 (1997)).

The results of the molecular modeling analysis may then be used inrational drug design techniques to identify agents which modulate theactivity of the polypeptide codes of SEQ ID NOS. 85-129 and 155-179.

Accordingly, another aspect of the present invention is a method ofidentifying a feature within the cDNA codes of SEQ ID NOs. 40-84 and130-154 or the polypeptide codes of SEQ ID NOS. 85-129 and 155-179comprising reading the cDNA code(s) or the polypeptide code(s) throughthe use of a computer program which identifies features therein andidentifying features within the cDNA code(s) or polypeptide code(s) withthe computer program. In one embodiment, computer program comprises acomputer program which identifies open reading frames. In a furtherembodiment, the computer programidentifies structural motifs in apolypeptide sequence. In another embodiment, the computer programcomprises a molecular modeling program. The method may be performed byreading a single sequence or at least 2, 5, 10, 15, 20, 25, 30, or 50 ofthe cDNA codes of SEQ ID NOs. 40-84 and 130-154 or the polypeptide codesof SEQ ID NOS. 85-129 and 155-179 through the use of the computerprogram and identifying features within the cDNA codes or polypeptidecodes with the computer program.

The cDNA codes of SEQ ID NOs. 40-84 and 130-154 or the polypeptide codesof SEQ ID NOS. 85-129 and 155-179 may be stored and manipulated in avariety of data processor programs in a variety of formats. For example,the cDNA codes of SEQ ID NOs. 40-84 and 130-154 or the polypeptide codesof SEQ ID NOS. 85-129 and 155-179 may be stored as text in a wordprocessing file, such as MicrosoftWORD or WORDPERFECT or as an ASCIIfile in a variety of database programs familiar to those of skill in theart, such as DB2, SYBASE, or ORACLE. In addition, many computer programsand databases may be used as sequence comparers, identifiers, or sourcesof reference nucleotide or polypeptide sequences to be compared to thecDNA codes of SEQ ID NOs. 40-84 and 130-154 or the polypeptide codes ofSEQ ID NOS. 85-129 and 155-179. The following list is intended not tolimit the invention but to provide guidance to programs and databaseswhich are useful with the cDNA codes of SEQ ID NOs. 40-84 and 130-154 orthe polypeptide codes of SEQ ID NOS. 85-129 and 155-179. The programsand databases which may be used include, but are not limited to:MACPATTERN (EMBL), DISCOVERYBASE (Molecular Applications Group),GENEMINE (Molecular Applications Group), LOOK (Molecular ApplicationsGroup), MACLOOK (Molecular Applications Group), BLAST and BLAST2 (NCBI),BLASTN and BLASTX (Altschul et al, J. Mol. Biol. 215: 403 (1990)), FASTA(Pearson and Lipman, Proc. Natl. Acad. Sci. USA, 85: 2444 (1988)),FASTDB (Brutlag et al. Comp. App. Biosci. 6:237-245, 1990), CATALYST(Molecular Simulations Inc.), CATALYST/SHAPE (Molecular SimulationsInc.), CERIUS².DBACCESS (Molecular Simulations Inc.), HYPOGEN (MolecularSimulations Inc.), INSIGHT II, (Molecular Simulations Inc.), DISCOVER(Molecular Simulations Inc.), CHARMM (Molecular Simulations Inc.), FELIX(Molecular Simulations Inc.), DELPHI, (Molecular Simulations Inc.),QUANTEMM, (Molecular Simulations Inc.), HOMOLOGY (Molecular SimulationsInc.), MODELER (Molecular Simulations Inc.), ISIS (Molecular SimulationsInc.), QUANTA/PROTEIN DESIGN (Molecular Simulations Inc.), WEBLAB(Molecular Simulations Inc.), WEBLAB DIVERSITY EXPLORER (MolecularSimulations Inc.), GENE EXPLORER (Molecular Simulations Inc.), SEQFOLD(Molecular Simulations Inc.), the EMBL/Swissprotein database, the MDLAvailable Chemicals Directory database, the MDL Drug Data Report database, the Comprehensive Medicinal Chemistry database, DERWENTS'S WorldDrug Index database, the BIOBYTEMASTERFILE database, the GENBANKdatabase, and the GENSEQN database. Many other programs and data baseswould be apparent to one of skill in the art given the presentdisclosure.

Motifs which may be detected using the above programs include sequencesencoding leucine zippers, helix-turn-helix motifs, glycosylation sites,ubiquitination sites, alpha helices, and beta sheets, signal sequencesencoding signal peptides which direct the secretion of the encodedproteins, sequences implicated in transcription regulation such ashomeoboxes, acidic stretches, enzymatic active sites, substrate bindingsites, and enzymatic cleavage sites.

5. Selection of Cloned Full Length Sequences of the Present Invention

Cloned full length extended cDNA sequences that have already beencharacterized by the aforementioned computer analysis are then submittedto an automatic procedure in order to preselect full length extendedcDNAs containing sequences of interest.

a) Automatic Sequence Preselection

All complete cloned full length extended cDNAs clipped for vector onboth ends are considered. First, a negative selection is operated inorder to eliminate unwanted cloned sequences resulting from eithercontaminants or PCR artifacts as follows. Sequences matching contaminantsequences such as vector RNA, tRNA, mtRNA, rRNA sequences are discardedas well as those encoding ORF sequences exhibiting extensive homology torepeats as defined in section 4a). Sequences obtained by direct cloningusing nested primers on 5′ and 3′ tags (section 1. case a) but lackingpolyA tail are discarded. Only ORFs containing a signal peptide andending either before the polyA tail (case a) or before the end of thecloned 3′UTR (case b) are kept. Then, ORFs containing unlikely matureproteins such as mature proteins which size is less than 20 amino acidsor less than 25% of the immature protein size are eliminated.

In the selection of the ORF, priority was given to the ORF and the framecorresponding to the polypeptides described in SignalTag Patents (U.S.patent application Ser. Nos. 08/905,223; 08/905,135; 08/905,051;08/905,144; 08/905,279; 08/904,468; 08/905,134; and 08/905,133). If theORF was not found among the ORFs described in the SignalTag Patents, theORF encoding the signal peptide with the highest score according to VonHeijne method as defined in Example 22 was chosen. If the scores wereidentical, then the longest ORF was chosen.

Sequences of full length extended cDNA clones are then compared pairwisewith BLAST after masking of the repeat sequences. Sequences containingat least 90% homology over 30 nucleotides are clustered in the sameclass. Each cluster is then subjected to a cluster analysis that detectssequences resulting from internal priming or from alternative splicing,identical sequences or sequences with several frameshifts. Thisautomatic analysis serves as a basis for manual selection of thesequences.

b) Manual Sequence Selection

Manual selection can be carried out using automatically generatedreports for each sequenced full length extended cDNA clone. During thismanual procedure, a selection is operated between clones belonging tothe same class as follows. ORF sequences encoded by clones belonging tothe same class are aligned and compared. If the homology betweennucleotidic sequences of clones belonging to the same class is more than90% over 30 nucleotide stretches or if the homology between amino acidsequences of clones belonging to the same class is more than 80% over 20amino acid stretches, than the clones are considered as being identical.The chosen ORF is the best one according to the criteria mentionedbelow. If the nucleotide and amino acid homologies are less than 90% and80% respectively, the clones are said to encode distinct proteins whichcan be both selected if they contain sequences of interest.

Selection of full length extended cDNA clones encoding sequences ofinterest is performed using the following criteria. Structuralparameters (initial tag, polyadenylation site and signal) are firstchecked. Then, homologies with known nucleic acids and proteins areexamined in order to determine whether the clone sequence match a knownnucleic/proteic sequence and, in the latter case, its covering rate andthe date at which the sequence became public. If there is no extensivematch with sequences other than ESTs or genomic DNA, or if the clonesequence brings substantial new information, such as encoding a proteinresulting from alternative slicing of an mRNA coding for an alreadyknown protein, the sequence is kept. Examples of such cloned full lengthextended cDNAs containing sequences of interest are described in Example28. Sequences resulting from chimera or double inserts as assessed byhomology to other sequences are discarded during this procedure.

EXAMPLE 28 Cloning and Sequencing of Extended cDNAs

The procedure described in Example 27 above was used to obtain theextended cDNAs of the present invention. Using this approach, the fulllength cDNA of SEQ ID NO:17 was obtained. This cDNA falls into the“EST-ext” category described above and encodes the signal peptideMKKVLLLITAILAVAVG (SEQ ID NO:18) having a von Heijne score of 8.2.

The full length cDNA of SEQ ID NO:19 was also obtained using thisprocedure. This cDNA falls into the “EST-ext” category described aboveand encodes the signal peptide MWWFQQGLSFLPSALVIWTSA (SEQ ID NO:20)having a von Heijne score of 5.5.

Another full length cDNA obtained using the procedure described abovehas the sequence of SEQ ID NO:21. This cDNA, falls into the “EST-ext”category described above and encodes the signal peptideMVLTTLPSANSANSPVNMPTTGPNSLSYASSALSPCLT (SEQ ID NO:22) having a vonHeijne score of 5.9.

The above procedure was also used to obtain a full length cDNA havingthe sequence of SEQ ID NO:23. This cDNA falls into the “EST-ext”category described above and encodes the signal peptide ILSTVTALTFAXA(SEQ ID NO:24) having a von Heijne score of 5.5.

The full length cDNA of SEQ ID NO:25 was also obtained using thisprocedure. This cDNA falls into the “new” category described above andencodes a signal peptide LVLTLCTLPLAVA (SEQ ID NO:26) having a vonHeijne score of 10.1.

The full length cDNA of SEQ ID NO:27 was also obtained using thisprocedure. This cDNA falls into the “new” category described above andencodes a signal peptide LWLLFFLVTAIHA (SEQ ID NO:28) having a vonHeijne score of 10.7.

The above procedures were also used to obtain the extended cDNAs of thepresent invention. 5′ ESTs expressed in a variety of tissues wereobtained as described above. The appended sequence listing provides thetissues from which the extended cDNAs were obtained. It will beappreciated that the extended cDNAs may also be expressed in tissuesother than the tissue listed in the sequence listing.

5′ ESTs obtained as described above were used to obtain extended cDNAshaving the sequences of SEQ ID NOs: 40-84 and 130-154. Table IV providesthe sequence identification numbers of the extended cDNAs of the presentinvention, the locations of the full coding sequences in SEQ ID NOs:40-84 and 130-154 (i.e. the nucleotides encoding both the signal peptideand the mature protein, listed under the heading FCS location in TableIV), the locations of the nucleotides in SEQ ID NOs: 40-84 and 130-154which encode the signal peptides (listed under the heading SigPepLocation in Table IV), the locations of the nucleotides in SEQ ID NOs:40-84 and 130-154 which encode the mature proteins generated by cleavageof the signal peptides (listed under the heading Mature PolypeptideLocation in Table IV), the locations in SEQ ID NOs: 40-84 and 130-154 ofstop codons (listed under the heading Stop Codon Location in Table IV),the locations in SEQ ID NOs: 40-84 and 130-154 of polyA signals (listedunder the heading Poly A Signal Location in Table IV) and the locationsof polyA sites (listed under the heading Poly A Site Location in TableIV).

The polypeptides encoded by the extended cDNAs were screened for thepresence of known structural or functional motifs or for the presence ofsignatures, small amino acid sequences which are well conserved amongstthe members of a protein family. The conserved regions have been used toderive consensus patterns or matrices included in the PROSITE data bank,in particular in the file prosite.dat (Release 13.0 of November 1995,located at http://expasy.hcuge.ch/sprot/prosite.html. Prosite_convertand prosite_scan programs(http://ulrec3.unil.ch/ftpserveur/prosite_scan) were used to findsignatures on the extended cDNAs.

For each pattern obtained with the prosite_convert program from theprosite.dat file, the accuracy of the detection on a new proteinsequence has been tested by evaluating the frequency of irrelevant hitson the population of human secreted proteins included in the data bankSWISSPROT. The ratio between the number of hits on shuffled proteins(with a window size of 20 amino acids) and the number of hits on native(unshuffled) proteins was used as an index. Every pattern for which theration was greater than 20% (one hit on shuffled proteins for 5 hits onnative proteins) was skipped during the search with prosite_scan. Theprogram used to shuffle protein sequences (db_shuffled) and the programused to determine the statistics for each pattern in the protein databanks (prosite_statistics) are available on the ftp sitehttp://ulrec3.unil.ch/ftpserveur/prosite_scan.

Table V lists the sequence identification numbers of the polypeptides ofSEQ ID NOs: 85-129 and 155-179, the locations of the amino acid residuesof SEQ ID NOs: 85-129 and 155-179 in the full length polypeptide (secondcolumn), the locations of the amino acid residues of SEQ ID NOs: 85-129and 155-179 in the signal peptides (third column), and the locations ofthe amino acid residues of SEQ ID NOs: 85-129 and 155-179 in the maturepolypeptide created by cleaving the signal peptide from the full lengthpolypeptide (fourth column).

The nucleotide sequences of the sequences of SEQ ID NOs: 40-84 and130-154 and the amino acid sequences encoded by SEQ ID NOs: 40-84 and130-154 (i.e. amino acid sequences of SEQ ID NOs: 85-129 and 155-179)are provided in the appended sequence listing. In some instances, thesequences are preliminary and may include some incorrect or ambiguoussequences or amino acids. The sequences of SEQ ID NOs: 40-84 and 130-154can readily be screened for any errors therein and any sequenceambiguities can be resolved by resequencing a fragment containing sucherrors or ambiguities on both strands. Sequences containing such errorswill generally be at least 95%, at least 96%, at least 97%, at least98%, or at least 99% homologous to the sequences of SEQ ID Nos. 85-129and 155-179 and such sequences are included in the nucleic acids andpolypeptides of the present invention. Nucleic acid fragments forresolving sequencing errors or ambiguities may be obtained from thedeposited clones or can be isolated using the techniques describedherein. Resolution of any such ambiguities or errors may be facilitatedby using primers which hybridize to sequences located close to theambiguous or erroneous sequences. For example, the primers may hybridizeto sequences within 50-75 bases of the ambiguity or error. Uponresolution of an error or ambiguity, the corresponding corrections canbe made in the protein sequences encoded by the DNA containing the erroror ambiguity. The amino acid sequence of the protein encoded by aparticular clone can also be determined by expression of the clone in asuitable host cell, collecting the protein, and determining itssequence.

For each amino acid sequence, Applicants have identified what they havedetermined to be the reading frame best identifiable with sequenceinformation available at the time of filing. Some of the amino acidsequences may contain “Xaa” designators. These “Xaa” designatorsindicate either (1) a residue which cannot be identified because ofnucleotide sequence ambiguity or (2) a stop codon in the determinedsequence where Applicants believe one should not exist (if the sequencewere determined more accurately).

Cells containing the extended cDNAs (SEQ ID NOs: 40-84 and 130-154) ofthe present invention in the vector pED6dpc2, are maintained inpermanent deposit by the inventors at Genset, S. A., 24 Rue Royale,75008 Paris, France.

Pools of cells containing the extended cDNAs (SEQ ID NOs: 40-84), fromwhich cells containing a particular polynucleotide are obtainable, weredeposited with the American Type Culture Collection (ATCC), 10801University Blvd., Manassas, Va., U.S.A., 20110-2209. Each extended cDNAclone has been transfected into separate bacterial cells (E-coli) forthis composite deposit. Table VI lists the deposit numbers of the clonesof SEQ ID Nos: 40-84. A pool of cells designated SignalTag 28011999,which contains the clones of SEQ ID NOs 71-84 was mailed to the EuropeanCollection of Cell Cultures, (ECACC) Vaccine Research and ProductionLaboratory, Public Health Laboratory Service, Centre for AppliedMicrobiology and Research, Porton Down, Salisbury, Wiltshire SP4 OJG,United Kingdom on Jan. 28, 1999 and was received on Jan. 29, 1999. Thispool of cells has the ECACC Accession # XXXXXX. One or more pools ofcells containing the extended cDNAs of SEQ ID Nos: 130-154, from whichthe cells containing a particular polynucleotide is obtainable, will bedeposited with the European Collection of Cell Cultures, VaccineResearch and Production Laboratory, Public Health Laboratory Service,Centre for Applied Microbiology and Research, Porton Down, Salisbury,Wiltshire SP4 OJG, United Kingdom and will be assigned ECACC depositnumber XXXXXXX. Table VII provides the internal designation numberassigned to each SEQ ID NO. and indicates whether the sequence is anucleic acid sequence or a protein sequence.

Each extended cDNA can be removed from the pED6dpc2 vector in which itwas deposited by performing a NotI, PstI double digestion to produce theappropriate fragment for each clone. The proteins encoded by theextended cDNAs may also be expressed from the promoter in pED6dpc2.

Bacterial cells containing a particular clone can be obtained from thecomposite deposit as follows:

An oligonucleotide probe or probes should be designed to the sequencethat is known for that particular clone. This sequence can be derivedfrom the sequences provided herein, or from a combination of thosesequences. The design of the oligonucleotide probe should preferablyfollow these parameters:

-   -   (a) It should be designed to an area of the sequence which has        the fewest ambiguous bases (“N's”), if any;    -   (b) Preferably, the probe is designed to have a T_(m) of approx.        80° C. (assuming 2 degrees for each A or T and 4 degrees for        each G or C). However, probes having melting temperatures        between 40° C. and 80° C. may also be used provided that        specificity is not lost.

The oligonucleotide should preferably be labeled with γ-[³²P]ATP(specific activity 6000 Ci/mmole) and T4 polynucleotide kinase usingcommonly employed techniques for labeling oligonucleotides. Otherlabeling techniques can also be used. Unincorporated label shouldpreferably be removed by gel filtration chromatography or otherestablished methods. The amount of radioactivity incorporated into theprobe should be quantified by measurement in a scintillation counter.Preferably, specific activity of the resulting probe should beapproximately 4×10⁶ dpm/pmole.

The bacterial culture containing the pool of full-length clones shouldpreferably be thawed and 100 μl of the stock used to inoculate a sterileculture flask containing 25 ml of sterile L-broth containing ampicillinat 100 ug/ml. The culture should preferably be grown to saturation at37° C., and the saturated culture should preferably be diluted in freshL-broth. Aliquots of these dilutions should preferably be plated todetermine the dilution and volume which will yield approximately 5000distinct and well-separated colonies on solid bacteriological mediacontaining L-broth containing ampicillin at 100 μg/ml and agar at 1.5%in a 150 mm petri dish when grown overnight at 37° C. Other knownmethods of obtaining distinct, well-separated colonies can also beemployed.

Standard colony hybridization procedures should then be used to transferthe colonies to nitrocellulose filters and lyse, denature and bake them.

The filter is then preferably incubated at 65° C. for 1 hour with gentleagitation in 6×SSC (20× stock is 175.3 g NaC1/liter, 88.2 g Nacitrate/liter, adjusted to pH 7.0 with NaOH) containing 0.5% SDS, 100pg/ml of yeast RNA, and 10 mM EDTA (approximately 10 mL per 150 mmfilter). Preferably, the probe is then added to the hybridization mix ata concentration greater than or equal to 1×10⁶ dpm/mL. The filter isthen preferably incubated at 65° C. with gentle agitation overnight. Thefilter is then preferably washed in 500 mL of 2×SSC/0.1% SDS at roomtemperature with gentle shaking for 15 minutes. A third wash with0.1×SSC/0.5% SDS at 65° C. for 30 minutes to 1 hour is optional. Thefilter is then preferably dried and subjected to autoradiography forsufficient time to visualize the positives on the X-ray film. Otherknown hybridization methods can also be employed.

The positive colonies are picked, grown in culture, and plasmid DNAisolated using standard procedures. The clones can then be verified byrestriction analysis, hybridization analysis, or DNA sequencing.

The plasmid DNA obtained using these procedures may then be manipulatedusing standard cloning techniques familiar to those skilled in the art.Alternatively, a PCR can be done with primers designed at both ends ofthe extended cDNA insertion. For example, a PCR reaction may beconducted using a primer having the sequence GGCCATACACTTGAGTGAC (SEQ IDNO:38) and a primer having the sequence ATATAGACAAACGCACACC (SEQ. ID.NO:39). The PCR product which corresponds to the extended cDNA can thenbe manipulated using standard cloning techniques familiar to thoseskilled in the art.

In addition to PCR based methods for obtaining extended cDNAs,traditional hybridization based methods may also be employed. Thesemethods may also be used to obtain the genomic DNAs which encode themRNAs from which the 5′ ESTs were derived, mRNAs corresponding to theextended cDNAs, or nucleic acids which are homologous to extended cDNAsor 5′ ESTs. Example 29 below provides an example of such methods.

EXAMPLE 29 Methods for Obtaining Extended cDNAs or Nucleic AcidsHomologous to Extended cDNAs or 5′ ESTs

5′ESTs or extended cDNAs of the present invention may also be used toisolate extended cDNAs or nucleic acids homologous to extended cDNAsfrom a cDNA library or a genomic DNA library. Such cDNA library orgenomic DNA library may be obtained from a commercial source or madeusing other techniques familiar to those skilled in the art. One exampleof such cDNA library construction is as follows.

PolyA+ RNAs are prepared and their quality checked as described inExample 13. Then, polyA+ RNAs are ligated to an oligonucleotide tagusing either the chemical or enzymatic methods described in abovesections 1 and 2. In both cases, the oligonucleotide tag may contain arestriction site such as Eco RI to facilitate further subcloningprocedures. Northern blotting is then performed to check the size ofligatured mRNAs and to ensure that the mRNAs were actually tagged.

As described in Example 14, first strand synthesis is subsequentlycarried out for mRNAs joined to the oligonucleotide tag replacing therandom nonamers by an oligodT primer. For instance, this oligodT primermay contain an internal tag of 4 nucleotides which is different from onetissue to the other. Alternatively, the oligonucleotide of SEQ ID NO:14may be used. Following second strand synthesis using a primer containedin the oligonucleotide tag attached to the 5′ end of mRNA, the bluntends of the obtained double stranded full length DNAs are modified intocohesive ends to allow subcloning into the Eco RI and Hind III sites ofa Bluescript vector using the addition of a Hind III adaptor to the 3′end of full length DNAs.

The extended full length DNAs are then separated into several fractionsaccording to their sizes using techniques familiar to those skilled inthe art. For example, electrophoretic separation may be applied in orderto yield 3 or 6 different fractions. Following gel extraction andpurification, the DNA fractions are subcloned into Bluescript vectors,transformed into competent bacteria and propagated under appropriateantibiotic conditions.

Such full length cDNA libraries may then be sequenced as follows or usedin screening procedures to obtain nucleic acids homologous to extendedcDNAs or 5′ ESTs as described below.

The 5′ end of extended cDNA isolated from the full length cDNA librariesor of nucleic acid homologous thereto may then be sequenced as describedin example 27. In a first step, the sequence corresponding to the 5′ endof the mRNA is obtained. If this sequence either corresponds to aSignalTag™ 5′EST or fulfills the criteria to be one, the cloned insertis subcloned into an appropriate vector such as pED6dpc2,double-sequenced and submitted to the analysis and selection proceduresdescribed in Example 27.

Such cDNA or genomic DNA libraries may be used to isolate extended cDNAsobtained from 5′ EST or nucleic acids homologous to extended cDNAs or 5′EST as follows. The cDNA library or genomic DNA library is hybridized toa detectable probe comprising at least 10 consecutive nucleotides fromthe 5′ EST or extended cDNA using conventional techniques. Preferably,the probe comprises at least 12, 15, or 17 consecutive nucleotides fromthe 5′ EST or extended cDNA. More preferably, the probe comprises atleast 20 to 30 consecutive nucleotides from the 5′ EST or extended cDNA.In some embodiments, the probe comprises at least 40, at least 50, atleast 75, at least 100, at least 150, or at least 200 consecutivenucleotides from the 5′ EST or extended cDNA.

Techniques for identifying cDNA clones in a cDNA library which hybridizeto a given probe sequence are disclosed in Sambrook et al., MolecularCloning: A Laboratory Manual 2d Ed., Cold Spring Harbor LaboratoryPress, 1989, the disclosure of which is incorporated herein byreference. The same techniques may be used to isolate genomic DNAs.

Briefly, cDNA or genomic DNA clones which hybridize to the detectableprobe are identified and isolated for further manipulation as follows. Aprobe comprising at least 10 consecutive nucleotides from the 5′ EST orextended cDNA is labeled with a detectable label such as a radioisotopeor a fluorescent molecule. Preferably, the probe comprises at least 12,15, or 17 consecutive nucleotides from the 5′ EST or extended cDNA. Morepreferably, the probe comprises 20 to 30 consecutive nucleotides fromthe 5′ EST or extended cDNA. In some embodiments, the probe comprises atleast 40, at least 50, at least 75, at least 100, at least 150, or atleast 200 consecutive nucleotides from the 5′ EST or extended cDNA.

Techniques for labeling the probe are well known and includephosphorylation with polynucleotide kinase, nick translation, in vitrotranscription, and non-radioactive techniques. The cDNAs or genomic DNAsin the library are transferred to a nitrocellulose or nylon filter anddenatured. After blocking of non-specific sites, the filter is incubatedwith the labeled probe for an amount of time sufficient to allow bindingof the probe to cDNAs or genomic DNAs containing a sequence capable ofhybridizing thereto.

By varying the stringency of the hybridization conditions used toidentify extended cDNAs or genomic DNAs which hybridize to thedetectable probe, extended cDNAS having different levels of homology tothe probe can be identified and isolated as described below.

1. Identification of Extended cDNA or Genomic DNA Sequences Having aHigh Degree of Homology to the Labeled Probe

To identify extended cDNAs or genomic DNAs having a high degree ofhomology to the probe sequence, the melting temperature of the probe maybe calculated using the following formulas:

For probes between 14 and 70 nucleotides in length the meltingtemperature (T_(m)) is calculated using the formula: Tm=81.5+16.6(log[Na+])+0.41(fraction G+C)-(600/N) where N is the length of the probe.

If the hybridization is carried out in a solution containing formamide,the melting temperature may be calculated using the equationTm=81.5+16.6(log [Na+])+0.41 (fraction G+C)−(0.63% formamide)−(600/N)where N is the length of the probe.

Prehybridization may be carried out in 6×SSC, 5× Denhardt's reagent,0.5% SDS, 100 μg denatured fragmented salmon sperm DNA or 6×SSC, 5×Denhardt's reagent, 0.5% SDS, 100 μg denatured fragmented salmon spermDNA, 50% formamide. The formulas for SSC and Denhardt's solutions arelisted in Sambrook et al., supra.

Hybridization is conducted by adding the detectable probe to theprehybridization solutions listed above. Where the probe comprisesdouble stranded DNA, it is denatured before addition to thehybridization solution. The filter is contacted with the hybridizationsolution for a sufficient period of time to allow the probe to hybridizeto extended cDNAs or genomic DNAs containing sequences complementarythereto or homologous thereto. For probes over 200 nucleotides inlength, the hybridization may be carried out at 15-25° C. below the Tm.For shorter probes, such as oligonucleotide probes, the hybridizationmay be conducted at 15-25° C. below the Tm. Preferably, forhybridizations in 6×SSC, the hybridization is conducted at approximately68° C. Preferably, for hybridizations in 50% formamide containingsolutions, the hybridization is conducted at approximately 42° C.

All of the foregoing hybridizations would be considered to be under“stringent” conditions.

Following hybridization, the filter is washed in 2×SSC, 0.1% SDS at roomtemperature for 15 minutes. The filter is then washed with 0.1×SSC, 0.5%SDS at room temperature for 30 minutes to 1 hour. Thereafter, thesolution is washed at the hybridization temperature in 0.1×SSC, 0.5%SDS. A final wash is conducted in 0.1×SSC at room temperature.

Extended cDNAs, nucleic acids homologous to extended cDNAs or 5′ ESTs,or genomic DNAs which have hybridized to the probe are identified byautoradiography or other conventional techniques.

2. Obtaining Extended cDNA or Genomic DNA Sequences Having Lower Degreesof Homology to the Labeled Probe

The above procedure may be modified to identify extended cDNAs, nucleicacids homologous to extended cDNAs, or genomic DNAs having decreasinglevels of homology to the probe sequence. For example, to obtainextended cDNAs, nucleic acids homologous to extended cDNAs, or genomicDNAs of decreasing homology to the detectable probe, less stringentconditions may be used. For example, the hybridization temperature maybe decreased in increments of 5° C. from 68° C. to 42° C. in ahybridization buffer having a sodium concentration of approximately 1M.Following hybridization, the filter may be washed with 2×SSC, 0.5% SDSat the temperature of hybridization. These conditions are considered tobe “moderate” conditions above 50° C. and “low” conditions below 50° C.

Alternatively, the hybridization may be carried out in buffers, such as6×SSC, containing formamide at a temperature of 42° C. In this case, theconcentration of formamide in the hybridization buffer may be reduced in5% increments from 50% to 0% to identify clones having decreasing levelsof homology to the probe. Following hybridization, the filter may bewashed with 6×SSC, 0.5% SDS at 50° C. These conditions are considered tobe “moderate” conditions above 25% formamide and “low” conditions below25% formamide.

Extended cDNAs, nucleic acids homologous to extended cDNAs, or genomicDNAs which have hybridized to the probe are identified byautoradiography.

3. Determination of the Degree of Homology between the Obtained ExtendedcDNAs or Genomic DNAs and the Labeled Probe

To determine the level of homology between the hybridized nucleic acidand the extended cDNA or 5′EST from which the probe was derived, thenucleotide sequences of the hybridized nucleic acid and the extendedcDNA or 5′EST from which the probe was derived are compared. Thesequences of the extended cDNA or 5′EST and the homologous sequences maybe stored on a computer readable medium as described in Example 17 aboveand may be compared using any of a variety of algorithms familiar tothose skilled in the art. For example, if it is desired to obtainnucleic acids homologous to extended cDNAs, such as allelic variantsthereof or nucleic acids encoding proteins related to the proteinsencoded by the extended cDNAs, the level of homology between thehybridized nucleic acid and the extended cDNA or 5′ EST used as theprobe may be determined using algorithms such as BLAST2N; parameters maybe adapted depending on the sequence length and degree of homologystudied. For example, the default parameters or the parameters in TableI and II may be used to determine homology levels.

Alternatively, the level of homology between the hybridized nucleic acidand the extended cDNA or 5′EST from which the probe was derived may bedetermined using the FASTDB algorithm described in Brutlag et al. Comp.App. Biosci. 6:237-245, 1990. In such analyses the parameters may beselected as follows: Matrix=Unitary, k-tuple=4, Mismatch Penalty=1,Joining Penalty=30, Randomization Group Length=0, Cutoff Score=1, GapPenalty-5, Gap Size Penalty=0.05, Window Size=500 or the length of thesequence which hybridizes to the probe, whichever is shorter. Becausethe FASTDB program does not consider 5′ or 3′ truncations whencalculating homology levels, if the sequence which hybridizes to theprobe is truncated relative to the sequence of the extended cDNA or5′EST from which the probe was derived the homology level is manuallyadjusted by calculating the number of nucleotides of the extended cDNAor 5′EST which are not matched or aligned with the hybridizing sequence,determining the percentage of total nucleotides of the hybridizingsequence which the non-matched or non-aligned nucleotides represent, andsubtracting this percentage from the homology level. For example, if thehybridizing sequence is 700 nucleotides in length and the extended cDNAsequence is 1000 nucleotides in length wherein the first 300 bases atthe 5′ end of the extended cDNA are absent from the hybridizingsequence, and wherein the overlapping 700 nucleotides are identical, thehomology level would be adjusted as follows. The non-matched,non-aligned 300 bases represent 30% of the length of the extended cDNA.If the overlapping 700 nucleotides are 100% identical, the adjustedhomology level would be 100−30=70% homology. It should be noted that thepreceding adjustments are only made when the non-matched or non-alignednucleotides are at the 5′ or 3′ ends. No adjustments are made if thenon-matched or non-aligned sequences are internal or under any otherconditions.

For example, using the above methods, nucleic acids having at least 95%nucleic acid homology, at least 96% nucleic acid homology, at least 97%nucleic acid homology, at least 98% nucleic acid homology, at least 99%nucleic acid homology, or more than 99% nucleic acid homology to theextended cDNA or 5′EST from which the probe was derived may be obtainedand identified. Such nucleic acids may be allelic variants or relatednucleic acids from other species. Similarly, by using progressively lessstringent hybridization conditions one can obtain and identify nucleicacids having at least 90%, at least 85%, at least 80% or at least 75%homology to the extended cDNA or 5′EST from which the probe was derived.

To determine whether a clone encodes a protein having a given amount ofhomology to the protein encoded by the extended cDNA or 5′ EST, theamino acid sequence encoded by the extended cDNA or 5′ EST is comparedto the amino acid sequence encoded by the hybridizing nucleic acid. Thesequences encoded by the extended cDNA or 5′EST and the sequencesencoded by the homologous sequences may be stored on a computer readablemedium as described in Example 17 above and may be compared using any ofa variety of algorithms familiar to those skilled in the art. Homologyis determined to exist when an amino acid sequence in the extended cDNAor 5′ EST is closely related to an amino acid sequence in thehybridizing nucleic acid. A sequence is closely related when it isidentical to that of the extended cDNA or 5′ EST or when it contains oneor more amino acid substitutions therein in which amino acids havingsimilar characteristics have been substituted for one another. Using theabove methods and algorithms such as FASTA with parameters depending onthe sequence length and degree of homology studied, for example thedefault parameters or the parameters in Table I and II, one can obtainnucleic acids encoding proteins having at least 99%, at least 98%, atleast 97%, at least 96%, at least 95%, at least 90%, at least 85%, atleast 80% or at least 75% homology to the proteins encoded by theextended cDNA or 5′EST from which the probe was derived. In someembodiments, the homology levels can be determined using the “default”opening penalty and the “default” gap penalty, and a scoring matrix suchas PAM 250 (a standard scoring matrix; see Dayhoff et al., in: Atlas ofProtein Sequence and Structure, Vol. 5, Supp. 3 (1978)).

Alternatively, the level of homology may be determined using the FASTDBalgorithm described by Brutlag et al. Comp. App. Biosci. 6:237-245,1990. In such analyses the parameters may be selected as follows:Matrix=PAM 0, k-tuple=2, Mismatch Penalty=1, Joining Penalty=20,Randomization Group Length=0, Cutoff Score=1, Window Size=SequenceLength, Gap Penalty=5, Gap Size Penalty=0.05, Window Size=500 or thelength of the homologous sequence, whichever is shorter. If thehomologous amino acid sequence is shorter than the amino acid sequenceencoded by the extended cDNA or 5′EST as a result of an N terminaland/or C terminal deletion the results may be manually corrected asfollows. First, the number of amino acid residues of the amino acidsequence encoded by the extended cDNA or 5′EST which are not matched oraligned with the homologous sequence is determined. Then, the percentageof the length of the sequence encoded by the extended cDNA or 5′ESTwhich the non-matched or non-aligned amino acids represent iscalculated. This percentage is subtracted from the homology level. Forexample wherein the amino acid sequence encoded by the extended cDNA or5′EST is 100 amino acids in length and the length of the homologoussequence is 80 amino acids and wherein the amino acid sequence encodedby the extended cDNA or 5′EST is truncated at the N terminal end withrespect to the homologous sequence, the homology level is calculated asfollows. In the preceding scenario there are 20 non-matched, non-alignedamino acids in the sequence encoded by the extended cDNA or 5′EST. Thisrepresents 20% of the length of the amino acid sequence encoded by theextended cDNA or 5′EST. If the remaining amino acids are 1005 identicalbetween the two sequences, the homology level would be 100%-20%=80%homology. No adjustments are made if the non-matched or non-alignedsequences are internal or under any other conditions.

In addition to the above described methods, other protocols areavailable to obtain extended cDNAs using 5′ ESTs as outlined in thefollowing paragraphs.

Extended cDNAs may be prepared by obtaining mRNA from the tissue, cell,or organism of interest using mRNA preparation procedures utilizingpolyA selection procedures or other techniques known to those skilled inthe art. A first primer capable of hybridizing to the polyA tail of themRNA is hybridized to the mRNA and a reverse transcription reaction isperformed to generate a first cDNA strand.

The first cDNA strand is hybridized to a second primer containing atleast 10 consecutive nucleotides of the sequences of the 5′ EST forwhich an extended cDNA is desired. Preferably, the primer comprises atleast 12, 15, or 17 consecutive nucleotides from the sequences of the 5′EST. More preferably, the primer comprises 20 to 30 consecutivenucleotides from the sequences of the 5′ EST. In some embodiments, theprimer comprises more than 30 nucleotides from the sequences of the 5′EST. If it is desired to obtain extended cDNAs containing the fullprotein coding sequence, including the authentic translation initiationsite, the second primer used contains sequences located upstream of thetranslation initiation site. The second primer is extended to generate asecond cDNA strand complementary to the first cDNA strand.Alternatively, RT-PCR may be performed as described above using primersfrom both ends of the cDNA to be obtained.

Extended cDNAs containing 5′ fragments of the mRNA may be prepared byhybridizing an mRNA comprising the sequence of the 5′ EST for which anextended cDNA is desired with a primer comprising at least 10consecutive nucleotides of the sequences complementary to the 5′ EST andreverse transcribing the hybridized primer to make a first cDNA strandfrom the mRNAs. Preferably, the primer comprises at least 12, 15, or 17consecutive nucleotides from the 5′ EST. More preferably, the primercomprises 20 to 30 consecutive nucleotides from the 5′ EST.

Thereafter, a second cDNA strand complementary to the first cDNA strandis synthesized. The second cDNA strand may be made by hybridizing aprimer complementary to sequences in the first cDNA strand to the firstcDNA strand and extending the primer to generate the second cDNA strand.

The double stranded extended cDNAs made using the methods describedabove are isolated and cloned. The extended cDNAs may be cloned intovectors such as plasmids or viral vectors capable of replicating in anappropriate host cell. For example, the host cell may be a bacterial,mammalian, avian, or insect cell.

Techniques for isolating mRNA, reverse transcribing a primer hybridizedto mRNA to generate a first cDNA strand, extending a primer to make asecond cDNA strand complementary to the first cDNA strand, isolating thedouble stranded cDNA and cloning the double stranded cDNA are well knownto those skilled in the art and are described in Current Protocols inMolecular Biology, John Wiley 503 Sons, Inc. 1997 and Sambrook et al.,Molecular Cloning: A Laboratory Manual, Second Edition, Cold SpringHarbor Laboratory Press, 1989, the entire disclosures of which areincorporated herein by reference.

Alternatively, other procedures may be used for obtaining full lengthcDNAs or extended cDNAs. In one approach, full length or extended cDNAsare prepared from mRNA and cloned into double stranded phagemids asfollows. The cDNA library in the double stranded phagemids is thenrendered single stranded by treatment with an endonuclease, such as theGene II product of the phage F1, and an exonuclease (Chang et al., Gene127:95-8, 1993). A biotinylated oligonucleotide comprising the sequenceof a 5′ EST, or a fragment containing at least 10 nucleotides thereof,is hybridized to the single stranded phagemids. Preferably, the fragmentcomprises at least 12, 15, or 17 consecutive nucleotides from the 5′EST. More preferably, the fragment comprises 20-30 consecutivenucleotides from the 5′ EST. In some procedures, the fragment maycomprise at least 40, at least 50, at least 75, at least 100, at least150, or at least 200 conscutive nucleotides from the 5′ EST.

Hybrids between the biotinylated oligonucleotide and phagemids havinginserts containing the 5′ EST sequence are isolated by incubating thehybrids with streptavidin coated paramagnetic beads and retrieving thebeads with a magnet (Fry et al., Biotechniques, 13: 124-131, 1992).Thereafter, the resulting phagemids containing the 5′ EST sequence arereleased from the beads and converted into double stranded DNA using aprimer specific for the 5′ EST sequence. Alternatively, protocols suchas the Gene Trapper kit (Gibco BRL) may be used. The resulting doublestranded DNA is transformed into bacteria. Extended cDNAs containing the5′ EST sequence are identified by colony PCR or colony hybridization.

Using any of the above described methods in section III, a plurality ofextended cDNAs containing full length protein coding sequences orsequences encoding only the mature protein remaining after the signalpeptide is cleaved off may be provided as cDNA libraries for subsequentevaluation of the encoded proteins or use in diagnostic assays asdescribed below.

IV. Expression of Proteins Encoded by Extended cDNAs Isolated Using 5′ESTs

Extended cDNAs containing the full protein coding sequences of theircorresponding mRNAs or portions thereof, such as cDNAs encoding themature protein, may be used to express the secreted proteins or portionsthereof which they encode as described in Example 30 below. If desired,the extended cDNAs may contain the sequences encoding the signal peptideto facilitate secretion of the expressed protein. It will be appreciatedthat a plurality of extended cDNAs containing the full protein codingsequences or portions thereof may be simultaneously cloned intoexpression vectors to create an expression library for analysis of theencoded proteins as described below.

EXAMPLE 30 Expression of the Proteins Encoded by Extended cDNAs orPortions Thereof

To express the proteins encoded by the extended cDNAs or portionsthereof, nucleic acids containing the coding sequence for the proteinsor portions thereof to be expressed are obtained as described inExamples 27-29 and cloned into a suitable expression vector. If desired,the nucleic acids may contain the sequences encoding the signal peptideto facilitate secretion of the expressed protein. For example, thenucleic acid may comprise the sequence of one of SEQ ID NOs: 40-84 and130-154 listed in Table IV and in the accompanying sequence listing.Alternatively, the nucleic acid may comprise those nucleotides whichmake up the full coding sequence of one of the sequences of SEQ ID NOs:40-84 and 130-154 as defined in Table IV above.

It will be appreciated that should the extent of the full codingsequence (i.e. the sequence encoding the signal peptide and the matureprotein resulting from cleavage of the signal peptide) differ from thatlisted in Table IV as a result of a sequencing error, reversetranscription or amplification error, mRNA splicing, post-translationalmodification of the encoded protein, enzymatic cleavage of the encodedprotein, or other biological factors, one skilled in the art would bereadily able to identify the extent of the full coding sequences in thesequences of SEQ ID NOs. 40-84 and 130-154. Accordingly, the scope ofany claims herein relating to nucleic acids containing the full codingsequence of one of SEQ ID NOs. 40-84 and 130-154 is not to be construedas excluding any readily identifiable variations from or equivalents tothe full coding sequences listed in Table IV Similarly, should theextent of the full length polypeptides differ from those indicated inTable V as a result of any of the preceding factors, the scope of claimsrelating to polypeptides comprising the amino acid sequence of the fulllength polypeptides is not to be construed as excluding any readilyidentifiable variations from or equivalents to the sequences listed inTable V.

Alternatively, the nucleic acid used to express the protein or portionthereof may comprise those nucleotides which encode the mature protein(i.e. the protein created by cleaving the signal peptide off) encoded byone of the sequences of SEQ ID NOs: 40-84 and 130-154 as defined inTable IV above.

It will be appreciated that should the extent of the sequence encodingthe mature protein differ from that listed in Table IV as a result of asequencing error, reverse transcription or amplification error, mRNAsplicing, post-translational modification of the encoded protein,enzymatic cleavage of the encoded protein, or other biological factors,one skilled in the art would be readily able to identify the extent ofthe sequence encoding the mature protein in the sequences of SEQ ID NOs.40-84 and 130-154. Accordingly, the scope of any claims herein relatingto nucleic acids containing the sequence encoding the mature proteinencoded by one of SEQ ID Nos. 40-84 and 130-154 is not to be construedas excluding any readily identifiable variations from or equivalents tothe sequences listed in Table IV. Thus, claims relating to nucleic acidscontaining the sequence encoding the mature protein encompassequivalents to the sequences listed in Table IV, such as sequencesencoding biologically active proteins resulting from post-translationalmodification, enzymatic cleavage, or other readily identifiablevariations from or equivalents to the secreted proteins in addition tocleavage of the signal peptide. Similarly, should the extent of themature polypeptides differ from those indicated in Table V as a resultof any of the preceding factors, the scope of claims relating topolypeptides comprising the sequence of a mature protein included in thesequence of one of SEQ ID NOs. 85-129 and 155-179 is not to be construedas excluding any readily identifiable variations from or equivalents tothe sequences listed in Table V. Thus, claims relating to polypeptidescomprising the sequence of the mature protein encompass equivalents tothe sequences listed in Table IV, such as biologically active proteinsresulting from post-translational modification, enzymatic cleavage, orother readily identifiable variations from or equivalents to thesecreted proteins in addition to cleavage of the signal peptide. It willalso be appreciated that should the biologically active form of thepolypeptides included in the sequence of one of SEQ ID NOs. 85-129 and155-179 or the nucleic acids encoding the biologically active form ofthe polypeptides differ from those identified as the mature polypeptidein Table V or the nucleotides encoding the mature polypeptide in TableIV as a result of a sequencing error, reverse transcription oramplification error, mRNA splicing, post-translational modification ofthe encoded protein, enzymatic cleavage of the encoded protein, or otherbiological factors, one skilled in the art would be readily able toidentify the amino acids in the biologically active form of thepolypeptides and the nucleic acids encoding the biologically active formof the polypeptides. In such instances, the claims relating topolypetides comprising the mature protein included in one of SEQ ID NOs.85-129 and 155-179 or nucleic acids comprising the nucleotides of one ofSEQ ID NOs. 40-84 and 130-154 encoding the mature protein shall not beconstrued to exclude any readily identifiable variations from thesequences listed in Table IV and Table V.

In some embodiments, the nucleic acid used to express the protein orportion thereof may comprise those nucleotides which encode the signalpeptide encoded by one of the sequences of SEQ ID NOs: 40-84 and 130-154as defined in Table IV above.

It will be appreciated that should the extent of the sequence encodingthe signal peptide differ from that listed in Table IV as a result of asequencing error, reverse transcription or amplification error, mRNAsplicing, post-translational modification of the encoded protein,enzymatic cleavage of the encoded protein, or other biological factors,one skilled in the art would be readily able to identify the extent ofthe sequence encoding the signal peptide in the sequences of SEQ ID NOs.40-84 and 130-154. Accordingly, the scope of any claims herein relatingto nucleic acids containing the sequence encoding the signal peptideencoded by one of SEQ ID Nos. 40-84 and 130-154 is not to be construedas excluding any readily identifiable variations from the sequenceslisted in Table IV. Similarly, should the extent of the signal peptidesdiffer from those indicated in Table V as a result of any of thepreceding factors, the scope of claims relating to polypeptidescomprising the sequence of a signal peptide included in the sequence ofone of SEQ ID NOs. 85-129 and 155-179 is not to be construed asexcluding any readily identifiable variations from the sequences listedin Table V.

Alternatively, the nucleic acid may encode a polypeptide comprising atleast 10 consecutive amino acids of one of the sequences of SEQ ID NOs:85-129 and 155-179. In some embodiments, the nucleic acid may encode apolypeptide comprising at least 15 consecutive amino acids of one of thesequences of SEQ ID NOs: 85-129 and 155-179. In other embodiments, thenucleic acid may encode a polypeptide comprising at least 25 consecutiveamino acids of one of the sequences of SEQ ID NOs: 85-129 and 155-179.In other embodiments, the nucleic acid may encode a polypeptidecomprising at least 60, at least 75, at least 100 or more than 100consecutive amino acids of one of the sequences of SEQ ID Nos: 85-129and 155-179.

The nucleic acids inserted into the expression vectors may also containsequences upstream of the sequences encoding the signal peptide, such assequences which regulate expression levels or sequences which confertissue specific expression.

The nucleic acid encoding the protein or polypeptide to be expressed isoperably linked to a promoter in an expression vector using conventionalcloning technology. The expression vector may be any of the mammalian,yeast, insect or bacterial expression systems known in the art.Commercially available vectors and expression systems are available froma variety of suppliers including Genetics Institute (Cambridge, Mass.),Stratagene (La Jolla, Calif.), Promega (Madison, Wis.), and Invitrogen(San Diego, Calif.). If desired, to enhance expression and facilitateproper protein folding, the codon context and codon pairing of thesequence may be optimized for the particular expression organism inwhich the expression vector is introduced, as explained by Hatfield, etal., U.S. Pat. No. 5,082,767, incorporated herein by this reference.

The following is provided as one exemplary method to express theproteins encoded by the extended cDNAs corresponding to the 5′ ESTs orthe nucleic acids described above. First, the methionine initiationcodon for the gene and the poly A signal of the gene are identified. Ifthe nucleic acid encoding the polypeptide to be expressed lacks amethionine to serve as the initiation site, an initiating methionine canbe introduced next to the first codon of the nucleic acid usingconventional techniques. Similarly, if the extended cDNA lacks a poly Asignal, this sequence can be added to the construct by, for example,splicing out the Poly A signal from pSG5 (Stratagene) using BglI andSalI restriction endonuclease enzymes and incorporating it into themammalian expression vector pXT1 (Stratagene). pXT1 contains the LTRsand a portion of the gag gene from Moloney Murine Leukemia Virus. Theposition of the LTRs in the construct allow efficient stabletransfection. The vector includes the Herpes Simplex Thymidine Kinasepromoter and the selectable neomycin gene. The extended cDNA or portionthereof encoding the polypeptide to be expressed is obtained by PCR fromthe bacterial vector using oligonucleotide primers complementary to theextended cDNA or portion thereof and containing restriction endonucleasesequences for Pst I incorporated into the 5′primer and BglII at the 5′end of the corresponding cDNA 3′ primer, taking care to ensure that theextended cDNA is positioned in frame with the poly A signal. Thepurified fragment obtained from the resulting PCR reaction is digestedwith PstI, blunt ended with an exonuclease, digested with Bgl II,purified and ligated to pXT1, now containing a poly A signal anddigested with BglII.

The ligated product is transfected into mouse NIH 3T3 cells usingLipofectin (Life Technologies, Inc., Grand Island, N.Y.) underconditions outlined in the product specification. Positive transfectantsare selected after growing the transfected cells in 600 ug/ml G418(Sigma, St. Louis, Mo.). Preferably the expressed protein is releasedinto the culture medium, thereby facilitating purification.

Alternatively, the extended cDNAs may be cloned into pED6dpc2 asdescribed above. The resulting pED6dpc2 constructs may be transfectedinto a suitable host cell, such as COS 1 cells. Methotrexate resistantcells are selected and expanded. Preferably, the protein expressed fromthe extended cDNA is released into the culture medium therebyfacilitating purification.

Proteins in the culture medium are separated by gel electrophoresis. Ifdesired, the proteins may be ammonium sulfate precipitated or separatedbased on size or charge prior to electrophoresis.

As a control, the expression vector lacking a cDNA insert is introducedinto host cells or organisms and the proteins in the medium areharvested. The secreted proteins present in the medium are detectedusing techniques such as Coomassie or silver staining or usingantibodies against the protein encoded by the extended cDNA. Coomassieand silver staining techniques are familiar to those skilled in the art.

Antibodies capable of specifically recognizing the protein of interestmay be generated using synthetic 15-mer peptides having a sequenceencoded by the appropriate 5′ EST, extended cDNA, or portion thereof.The synthetic peptides are injected into mice to generate antibody tothe polypeptide encoded by the 5′ EST, extended cDNA, or portionthereof.

Secreted proteins from the host cells or organisms containing anexpression vector which contains the extended cDNA derived from a 5′ ESTor a portion thereof are compared to those from the control cells ororganism. The presence of a band in the medium from the cells containingthe expression vector which is absent in the medium from the controlcells indicates that the extended cDNA encodes a secreted protein.Generally, the band corresponding to the protein encoded by the extendedcDNA will have a mobility near that expected based on the number ofamino acids in the open reading frame of the extended cDNA. However, theband may have a mobility different than that expected as a result ofmodifications such as glycosylation, ubiquitination, or enzymaticcleavage.

Alternatively, if the protein expressed from the above expressionvectors does not contain sequences directing its secretion, the proteinsexpressed from host cells containing an expression vector containing aninsert encoding a secreted protein or portion thereof can be compared tothe proteins expressed in host cells containing the expression vectorwithout an insert. The presence of a band in samples from cellscontaining the expression vector with an insert which is absent insamples from cells containing the expression vector without an insertindicates that the desired protein or portion thereof is beingexpressed. Generally, the band will have the mobility expected for thesecreted protein or portion thereof. However, the band may have amobility different than that expected as a result of modifications suchas glycosylation, ubiquitination, or enzymatic cleavage.

The protein encoded by the extended cDNA may be purified using standardimmunochromatography techniques. In such procedures, a solutioncontaining the secreted protein, such as the culture medium or a cellextract, is applied to a column having antibodies against the secretedprotein attached to the chromatography matrix. The secreted protein isallowed to bind the immunochromatography column. Thereafter, the columnis washed to remove non-specifically bound proteins. The specificallybound secreted protein is then released from the column and recoveredusing standard techniques.

If antibody production is not possible, the extended cDNA sequence orportion thereof may be incorporated into expression vectors designed foruse in purification schemes employing chimeric polypeptides. In suchstrategies the coding sequence of the extended cDNA or portion thereofis inserted in frame with the gene encoding the other half of thechimera. The other half of the chimera may be β-globin or a nickelbinding polypeptide encoding sequence. A chromatography matrix havingantibody to β-globin or nickel attached thereto is then used to purifythe chimeric protein. Protease cleavage sites may be engineered betweenthe β-globin gene or the nickel binding polypeptide and the extendedcDNA or portion thereof. Thus, the two polypeptides of the chimera maybe separated from one another by protease digestion.

One useful expression vector for generating β-globin chimerics is pSG5(Stratagene), which encodes rabbit β-globin. Intron II of the rabbitβ-globin gene facilitates splicing of the expressed transcript, and thepolyadenylation signal incorporated into the construct increases thelevel of expression. These techniques as described are well known tothose skilled in the art of molecular biology. Standard methods arepublished in methods texts such as Davis et al., (Basic Methods inMolecular Biology, L. G. Davis, M. D. Dibner, and J. F. Battey, ed.,Elsevier Press, NY, 1986) and many of the methods are available fromStratagene, Life Technologies, Inc., or Promega. Polypeptide mayadditionally be produced from the construct using in vitro translationsystems such as the IN VITRO EXPRESS™ Translation Kit (Stratagene).

Following expression and purification of the secreted proteins encodedby the 5′ ESTs, extended cDNAs, or fragments thereof, the purifiedproteins may be tested for the ability to bind to the surface of variouscell types as described in Example 31 below. It will be appreciated thata plurality of proteins expressed from these cDNAs may be included in apanel of proteins to be simultaneously evaluated for the activitiesspecifically described below, as well as other biological roles forwhich assays for determining activity are available.

EXAMPLE 31 Analysis of Secreted Proteins to Determine Whether they Bindto the Cell Surface

The proteins encoded by the 5′ ESTs, extended cDNAs, or fragmentsthereof are cloned into expression vectors such as those described inExample 30. The proteins are purified by size, charge,immunochromatography or other techniques familiar to those skilled inthe art. Following purification, the proteins are labeled usingtechniques known to those skilled in the art. The labeled proteins areincubated with cells or cell lines derived from a variety of organs ortissues to allow the proteins to bind to any receptor present on thecell surface. Following the incubation, the cells are washed to removenon-specifically bound protein. The labeled proteins are detected byautoradiography. Alternatively, unlabeled proteins may be incubated withthe cells and detected with antibodies having a detectable label, suchas a fluorescent molecule, attached thereto.

Specificity of cell surface binding may be analyzed by conducting acompetition analysis in which various amounts of unlabeled protein areincubated along with the labeled protein. The amount of labeled proteinbound to the cell surface decreases as the amount of competitiveunlabeled protein increases. As a control, various amounts of anunlabeled protein unrelated to the labeled protein is included in somebinding reactions. The amount of labeled protein bound to the cellsurface does not decrease in binding reactions containing increasingamounts of unrelated unlabeled protein, indicating that the proteinencoded by the cDNA binds specifically to the cell surface.

As discussed above, secreted proteins have been shown to have a numberof important physiological effects and, consequently, represent avaluable therapeutic resource. The secreted proteins encoded by theextended cDNAs or portions thereof made according to Examples 27-29 maybe evaluated to determine their physiological activities as describedbelow.

EXAMPLE 32 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Cytokine, Cell Proliferation or CellDifferentiation Activity

As discussed above, secreted proteins may act as cytokines or may affectcellular proliferation or differentiation. Many protein factorsdiscovered to date, including all known cytokines, have exhibitedactivity in one or more factor dependent cell proliferation assays, andhence the assays serve as a convenient confirmation of cytokineactivity. The activity of a protein of the present invention isevidenced by any one of a number of routine factor dependent cellproliferation assays for cell lines including, without limitation, 32D,DA2, DA1G, T10, B9, B9/11, BaF3, MC9/G, M+ (preB M+), 2E8, RB5, DA1,123, T1165, HT2, CTLL2, TF-1, Mo7c and CMK. The proteins encoded by theabove extended cDNAs or portions thereof may be evaluated for theirability to regulate T cell or thymocyte proliferation in assays such asthose described above or in the following references, which areincorporated herein by reference: Current Protocols in Immunology, Ed.by J. E. Coligan et al., Greene Publishing Associates andWiley-Interscience; Takai et al. J. Immunol. 137:3494-3500, 1986.Bertagnolli et al. J. Immunol. 145:1706-1712, 1990. Bertagnolli et al.,Cellular Immunology 133:327-341, 1991. Bertagnolli, et al. J. Immunol.149:3778-3783, 1992; Bowman et al., J. Immunol. 152:1756-1761, 1994.

In addition, numerous assays for cytokine production and/or theproliferation of spleen cells, lymph node cells and thymocytes areknown. These include the techniques disclosed in Current Protocols inImmunology. J. E. Coligan et al. Eds., Vol 1 pp. 3.12.1-3.12.14 JohnWiley and Sons, Toronto. 1994; and Schreiber, R. D. Current Protocols inImmunology, supra Vol 1 pp. 6.8.1-6.8.8, John Wiley and Sons, Toronto.1994.

The proteins encoded by the cDNAs may also be assayed for the ability toregulate the proliferation and differentiation of hematopoietic orlymphopoietic cells. Many assays for such activity are familiar to thoseskilled in the art, including the assays in the following references,which are incorporated herein by reference: Bottomly, K., Davis, L. S.and Lipsky, P. E., Measurement of Human and Murine Interleukin 2 andInterleukin 4, Current Protocols in Immunology., J. E. Coligan et al.Eds. Vol 1 pp. 6.3.1-6.3.12, John Wiley and Sons, Toronto. 1991; deVrieset al., J. Exp. Med. 173:1205-1211, 1991; Moreau et al., Nature36:690-692, 1988; Greenberger et al., Proc. Natl. Acad. Sci. U.S.A.80:2931-2938, 1983; Nordan, R., Measurement of Mouse and HumanInterleukin 6 Current Protocols in Immunology. J. E. Coligan et al. Eds.Vol 1 pp. 6.6.1-6.6.5, John Wiley and Sons, Toronto. 1991; Smith et al.,Proc. Natl. Acad. Sci. U.S.A. 83:1857-1861, 1986; Bennett, F.,Giannotti, J., Clark, S. C. and Turner, K. J., Measurement of HumanInterleukin 11 Current Protocols in Immunology. J. E. Coligan et al.Eds. Vol 1 pp. 6.15.1 John Wiley and Sons, Toronto. 1991; Ciarletta, A.,Giannotti, J., Clark, S. C. and Turner, K. J., Measurement of Mouse andHuman Interleukin 9 Current Protocols in Immunology. J. E. Coligan etal., Eds. Vol 1 pp. 6.13.1, John Wiley and Sons, Toronto. 1991.

The proteins encoded by the cDNAs may also be assayed for their abilityto regulate T-cell responses to antigens. Many assays for such activityare familiar to those skilled in the art, including the assays describedin the following references, which are incorporated herein by reference:Chapter 3 (In Vitro Assays for Mouse Lymphocyte Function), Chapter 6(Cytokines and Their Cellular Receptors) and Chapter 7, (ImmunologicStudies in Humans) in Current Protocols in Immunology, J. E. Coligan etal. Eds. Greene Publishing Associates and Wiley-Interscience; Weinbergeret al., Proc. Natl. Acad. Sci. USA 77:6091-6095, 1980; Weinberger etal., Eur. J. Immun. 11:405-411, 1981; Takai et al., J. Immunol.137:3494-3500, 1986; Takai et al., J. Immunol. 140:508-512, 1988.

Those proteins which exhibit cytokine, cell proliferation, or celldifferentiation activity may then be formulated as pharmaceuticals andused to treat clinical conditions in which induction of cellproliferation or differentiation is beneficial. Alternatively, asdescribed in more detail below, genes encoding these proteins or nucleicacids regulating the expression of these proteins may be introduced intoappropriate host cells to increase or decrease the expression of theproteins as desired.

EXAMPLE 33 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Activity as Immune System Regulators

The proteins encoded by the cDNAs may also be evaluated for theireffects as immune regulators. For example, the proteins may be evaluatedfor their activity to influence thymocyte or splenocyte cytotoxicity.Numerous assays for such activity are familiar to those skilled in theart including the assays described in the following references, whichare incorporated herein by reference: Chapter 3 (In Vitro Assays forMouse Lymphocyte Function 3.1-3.19) and Chapter 7 (Immunologic studiesin Humans) in Current Protocols in Immunology, J. E. Coligan et al. Eds,Greene Publishing Associates and Wiley-Interscience; Herrmann et al.,Proc. Natl. Acad. Sci. USA 78:2488-2492, 1981; Herrmann et al., J.Immunol. 128:1968-1974, 1982; Handa et al., J. Immunol. 135:1564-1572,1985; Takai et al., J. Immunol. 137:3494-3500, 1986; Takai et al., J.Immunol. 140:508-512, 1988; Herrmann et al., Proc. Natl. Acad. Sci. USA78:2488-2492, 1981; Herrmann et al., J. Immunol. 128:1968-1974, 1982;Handa et al., J. Immunol. 135:1564-1572, 1985; Takai et al., J. Immunol.137:3494-3500, 1986; Bowman et al., J. Virology 61:1992-1998; Takai etal., J. Immunol. 140:508-512, 1988; Bertagnolli et al., CellularImmunology 133:327-341, 1991; Brown et al., J. Immunol. 153:3079-3092,1994.

The proteins encoded by the cDNAs may also be evaluated for theireffects on T-cell dependent immunoglobulin responses and isotypeswitching. Numerous assays for such activity are familiar to thoseskilled in the art, including the assays disclosed in the followingreferences, which are incorporated herein by reference: Maliszewski, J.Immunol. 144:3028-3033, 1990; Mond, J. J. and Brunswick, M Assays for BCell Function: In vitro Antibody Production, Vol 1 pp. 3.8.1-3.8.16 inCurrent Protocols in Immunology. J. E. Coligan et al Eds., John Wileyand Sons, Toronto. 1994.

The proteins encoded by the cDNAs may also be evaluated for their effecton immune effector cells, including their effect on Th1 cells andcytotoxic lymphocytes. Numerous assays for such activity are familiar tothose skilled in the art, including the assays disclosed in thefollowing references, which are incorporated herein by reference:Chapter 3 (In Vitro Assays for Mouse Lymphocyte Function 3.1-3.19) andChapter 7 (Immunologic Studies in Humans) in Current Protocols inImmunology, J. E. Coligan et al. Eds., Greene Publishing Associates andWiley-Interscience; Takai et al., J. Immunol. 137:3494-3500, 1986; Takaiet al.; J. Immunol. 140:508-512, 1988; Bertagnolli et al., J. Immunol.149:3778-3783, 1992.

The proteins encoded by the cDNAs may also be evaluated for their effecton dendritic cell mediated activation of naive T-cells. Numerous assaysfor such activity are familiar to those skilled in the art, includingthe assays disclosed in the following references, which are incorporatedherein by reference: Guery et al., J. Immunol. 134:536-544, 1995; Inabaet al., Journal of Experimental Medicine 173:549-559, 1991; Macatonia etal., Journal of Immunology 154:5071-5079, 1995; Porgador et al., Journalof Experimental Medicine 182:255-260, 1995; Nair et al., Journal ofVirology 67:40624069, 1993; Huang et al., Science 264:961-965, 1994;Macatonia et al., Journal of Experimental Medicine 169:1255-1264, 1989;Bhardwaj et al., Journal of Clinical Investigation 94:797-807, 1994; andInaba et al., Journal of Experimental Medicine 172:631-640, 1990.

The proteins encoded by the cDNAs may also be evaluated for theirinfluence on the lifetime of lymphocytes. Numerous assays for suchactivity are familiar to those skilled in the art, including the assaysdisclosed in the following references, which are incorporated herein byreference: Darzynkiewicz et al., Cytometry 13:795-808, 1992; Gorczyca etal., Leukemia 7:659-670, 1993; Gorczyca et al., Cancer Research53:1945-1951, 1993; Itoh et al., Cell 66:233-243, 1991; Zacharchuk,Journal of Immunology 145:4037-4045, 1990; Zamai et al., Cytometry14:891-897, 1993; Gorczyca et al., International Journal of Oncology1:639-648, 1992.

Assays for proteins that influence early steps of T-cell commitment anddevelopment include, without limitation, those described in: Antica etal., Blood 84:111-117, 1994; Fine et al., Cellular immunology155:111-122, 1994; Galy et al., Blood 85:2770-2778, 1995; Toki et al.,Proc. Nat. Acad. Sci. USA 88:7548-7551, 1991.

Those proteins which exhibit activity as immune system regulatorsactivity may then be formulated as pharmaceuticals and used to treatclinical conditions in which regulation of immune activity isbeneficial. For example, the protein may be useful in the treatment ofvarious immune deficiencies and disorders (including severe combinedimmunodeficiency (SCID)), e.g., in regulating (up or down) growth andproliferation of T and/or B lymphocytes, as well as effecting thecytolytic activity of NK cells and other cell populations. These immunedeficiencies may be genetic or be caused by viral (e.g., HIV) as well asbacterial or fungal infections, or may result from autoimmune disorders.More specifically, infectious diseases caused by viral, bacterial,fungal or other infection may be treatable using a protein of thepresent invention, including infections by HIV, hepatitis viruses,herpesviruses, mycobacteria, Leishmania spp., malaria spp. and variousfungal infections such as candidiasis. Of course, in this regard, aprotein of the present invention may also be useful where a boost to theimmune system generally may be desirable, i.e., in the treatment ofcancer.

Autoimmune disorders which may be treated using a protein of the presentinvention include, for example, connective tissue disease, multiplesclerosis, systemic lupus erythematosus, rheumatoid arthritis,autoimmune pulmonary inflammation, Guillain-Barre syndrome, autoimmunethyroiditis, insulin dependent diabetes mellitis, myasthenia gravis,graft-versus-host disease and autoimmune inflammatory eye disease. Sucha protein of the present invention may also to be useful in thetreatment of allergic reactions and conditions, such as asthma(particularly allergic asthma) or other respiratory problems. Otherconditions, in which immune suppression is desired (including, forexample, organ transplantation), may also be treatable using a proteinof the present invention.

Using the proteins of the invention it may also be possible to regulateimmune responses, in a number of ways. Down regulation may be in theform of inhibiting or blocking an immune response already in progress ormay involve preventing the induction of an immune response. Thefunctions of activated T-cells may be inhibited by suppressing T cellresponses or by inducing specific tolerance in T cells, or both.Immunosuppression of T cell responses is generally an active,non-antigen-specific, process which requires continuous exposure of theT cells to the suppressive agent. Tolerance, which involves inducingnon-responsiveness or anergy in T cells, is distinguishable fromimmunosuppression in that it is generally antigen-specific and persistsafter exposure to the tolerizing agent has ceased. Operationally,tolerance can be demonstrated by the lack of a T cell response uponreexposure to specific antigen in the absence of the tolerizing agent.

Down regulating or preventing one or more antigen functions (includingwithout limitation B lymphocyte antigen functions (such as, for example,B7)), e.g., preventing high level lymphokine synthesis by activated Tcells, will be useful in situations of tissue, skin and organtransplantation and in graft-versus-host disease (GVHD). For example,blockage of T cell function should result in reduced tissue destructionin tissue transplantation. Typically, in tissue transplants, rejectionof the transplant is initiated through its recognition as foreign by Tcells, followed by an immune reaction that destroys the transplant. Theadministration of a molecule which inhibits or blocks interaction of aB7 lymphocyte antigen with its natural ligand(s) on immune cells (suchas a soluble, monomeric form of a peptide having B7-2 activity alone orin conjunction with a monomeric form of a peptide having an activity ofanother B lymphocyte antigen (e.g., B7-1, B7-3) or blocking antibody),prior to transplantation can lead to the binding of the molecule to thenatural ligand(s) on the immune cells without transmitting thecorresponding costimulatory signal. Blocking B lymphocyte antigenfunction in this matter prevents cytokine synthesis by immune cells,such as T cells, and thus acts as an immunosuppressant. Moreover, thelack of costimulation may also be sufficient to anergize the T cells,thereby inducing tolerance in a subject. Induction of long-termtolerance by B lymphocyte antigen-blocking reagents may avoid thenecessity of repeated administration of these blocking reagents. Toachieve sufficient immunosuppression or tolerance in a subject, it mayalso be necessary to block the function of a combination of B lymphocyteantigens.

The efficacy of particular blocking reagents in preventing organtransplant rejection or GVHD can be assessed using animal models thatare predictive of efficacy in humans. Examples of appropriate systemswhich can be used include allogeneic cardiac grafts in rats andxenogeneic pancreatic islet cell grafts in mice, both of which have beenused to examine the immunosuppressive effects of CTLA4Ig fusion proteinsin vivo as described in Lenschow et al., Science 257:789-792 (1992) andTurka et al., Proc. Natl. Acad. Sci USA, 89:11102-11105 (1992). Inaddition, murine models of GVHD (see Paul ed., Fundamental Immunology,Raven Press, New York, 1989, pp. 846-847) can be used to determine theeffect of blocking B lymphocyte antigen function in vivo on thedevelopment of that disease.

Blocking antigen function may also be therapeutically useful fortreating autoimmune diseases. Many autoimmune disorders are the resultof inappropriate activation of T cells that are reactive against selftissue and which promote the production of cytokines and autoantibodiesinvolved in the pathology of the diseases. Preventing the activation ofautoreactive T cells may reduce or eliminate disease symptoms.Administration of reagents which block costimulation of T cells bydisrupting receptor ligand interactions of B lymphocyte antigens can beused to inhibit T cell activation and prevent production ofautoantibodies or T cell-derived cytokines which may be involved in thedisease process. Additionally, blocking reagents may induceantigen-specific tolerance of autoreactive T cells which could lead tolong-term relief from the disease. The efficacy of blocking reagents inpreventing or alleviating autoimmune disorders can be determined using anumber of well-characterized animal models of human autoimmune diseases.Examples include murine experimental autoimmune encephalitis, systemiclupus erythmatosis in MRL/pr/pr mice or NZB hybrid mice, murineautoimmuno collagen arthritis, diabetes mellitus in OD mice and BB rats,and murine experimental myasthenia gravis (see Paul ed., FundamentalImmunology, Raven Press, New York, 1989, pp. 840-856).

Upregulation of an antigen function (preferably a B lymphocyte antigenfunction), as a means of up regulating immune responses, may also beuseful in therapy. Upregulation of immune responses may be in the formof enhancing an existing immune response or eliciting an initial immuneresponse. For example, enhancing an immune response through stimulatingB lymphocyte antigen function may be useful in cases of viral infection.In addition, systemic viral diseases such as influenza, the common cold,and encephalitis might be alleviated by the administration ofstimulatory form of B lymphocyte antigens systemically.

Alternatively, anti-viral immune responses may be enhanced in aninfected patient by removing T cells from the patient, costimulating theT cells in vitro with viral antigen-pulsed APCs either expressing apeptide of the present invention or together with a stimulatory form ofa soluble peptide of the present invention and reintroducing the invitro activated T cells into the patient. The infected cells would nowbe capable of delivering a costimulatory signal to T cells in vivo,thereby activating the T cells.

In another application, up regulation or enhancement of antigen function(preferably B lymphocyte antigen function) may be useful in theinduction of tumor immunity. Tumor cells (e.g., sarcoma, melanoma,lymphoma, leukemia, neuroblastoma, carcinoma) transfected with a nucleicacid encoding at least one peptide of the present invention can beadministered to a subject to overcome tumor-specific tolerance in thesubject. If desired, the tumor cell can be transfected to express acombination of peptides. For example, tumor cells obtained from apatient can be transfected ex vivo with an expression vector directingthe expression of a peptide having B7-2-like activity alone, or inconjunction with a peptide having B7-1-like activity and/or B7-3-likeactivity. The transfected tumor cells are returned to the patient toresult in expression of the peptides on the surface of the transfectedcell. Alternatively, gene therapy techniques can be used to target atumor cell for transfection in vivo.

The presence of the peptide of the present invention having the activityof a B lymphocyte antigen(s) on the surface of the tumor cell providesthe necessary costimulation signal to T cells to induce a T cellmediated immune response against the transfected tumor cells. Inaddition, tumor cells which lack MHC class I or MHC class II molecules,or which fail to reexpress sufficient amounts of MHC class I or MHCclass II molecules, can be transfected with nucleic acids encoding allor a portion of (e.g., a cytoplasmic-domain truncated portion) of an MHCclass I α chain protein and β₂ macroglobulin protein or an MHC class IIα chain protein and an MHC class II β chain protein to thereby expressMHC class I or MHC class II proteins on the cell surface. Expression ofthe appropriate class II or class II MHC in conjunction with a peptidehaving the activity of a B lymphocyte antigen (e.g., B7-1, B7-2, B7-3)induces a T cell mediated immune response against the transfected tumorcell. Optionally, a gene encoding an antisense construct which blocksexpression of an MHC class II associated protein, such as the invariantchain, can also be cotransfected with a DNA encoding a peptide havingthe activity of a B lymphocyte antigen to promote presentation of tumorassociated antigens and induce tumor specific immunity. Thus, theinduction of a T cell mediated immune response in a human subject may besufficient to overcome tumor-specific tolerance in the subject.Alternatively, as described in more detail below, genes encoding theseproteins or nucleic acids regulating the expression of these proteinsmay be introduced into appropriate host cells to increase or decreasethe expression of the proteins as desired.

EXAMPLE 34 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Hematopoiesis Regulating Activity

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for their hematopoiesis regulating activity. For example,the effect of the proteins on embryonic stem cell differentiation may beevaluated. Numerous assays for such activity are familiar to thoseskilled in the art, including the assays disclosed in the followingreferences, which are incorporated herein by reference: Johansson et al.Cellular Biology 15:141-151, 1995; Keller et al., Molecular and CellularBiology 13:473-486, 1993; McClanahan et al., Blood 81:2903-2915, 1993.

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for their influence on the lifetime of stem cells and stemcell differentiation. Numerous assays for such activity are familiar tothose skilled in the art, including the assays disclosed in thefollowing references, which are incorporated herein by reference:Freshney, M. G. Methylcellulose Colony Forming Assays, in Culture ofHematopoietic Cells. R. I. Freshney, et al. Eds. pp. 265-268,Wiley-Liss, Inc., New York, N.Y. 1994; Hirayama et al., Proc. Natl.Acad. Sci. USA 89:5907-5911, 1992; McNiece, I. K. and Briddell, R. A.Primitive Hematopoietic Colony Forming Cells with High ProliferativePotential, in Culture of Hematopoietic Cells. R. I. Freshney, et al.eds. Vol pp. 23-39, Wiley-Liss, Inc., New York, N.Y. 1994; Neben et al.,Experimental Hematology 22:353-359, 1994; Ploemacher, R. E. CobblestoneArea Forming Cell Assay, In Culture of Hematopoietic Cells. R. I.Freshney, et al. Eds. pp. 1-21, Wiley-Liss, Inc., New York, N.Y. 1994;Spooncer, E., Dexter, M. and Allen, T. Long Term Bone Marrow Cultures inthe Presence of Stromal Cells, in Culture of Hematopoietic Cells. R. I.Freshney, et al. Eds. pp. 163-179, Wiley-Liss, Inc., New York, N.Y.1994; and Sutherland, H. J. Long Term Culture Initiating Cell Assay, inCulture of Hematopoietic Cells. R. I. Freshney, et al. Eds. pp. 139-162,Wiley-Liss, Inc., New York, N.Y. 1994.

Those proteins which exhibit hematopoiesis regulatory activity may thenbe formulated as pharmaceuticals and used to treat clinical conditionsin which regulation of hematopoeisis is beneficial. For example, aprotein of the present invention may be useful in regulation ofhematopoiesis and, consequently, in the treatment of myeloid or lymphoidcell deficiencies. Even marginal biological activity in support ofcolony forming cells or of factor-dependent cell lines indicatesinvolvement in regulating hematopoiesis, e.g. in supporting the growthand proliferation of erythroid progenitor cells alone or in combinationwith other cytokines, thereby indicating utility, for example, intreating various anemias or for use in conjunction withirradiation/chemotherapy to stimulate the production of erythroidprecursors and/or erythroid cells; in supporting the growth andproliferation of myeloid cells such as granulocytes andmonocytes/macrophages (i.e., traditional CSF activity) useful, forexample, in conjunction with chemotherapy to prevent or treat consequentmyelo-suppression; in supporting the growth and proliferation ofmegakaryocytes and consequently of platelets thereby allowing preventionor treatment of various platelet disorders such as thrombocytopenia, andgenerally for use in place of or complimentary to platelet transfusions;and/or in supporting the growth and proliferation of hematopoietic stemcells which are capable of maturing to any and all of theabove-mentioned hematopoietic cells and therefore find therapeuticutility in various stem cell disorders (such as those usually treatedwith transplantion, including, without limitation, aplastic anemia andparoxysmal nocturnal hemoglobinuria), as well as in repopulating thestem cell compartment post irradiation/chemotherapy, either in-vivo orex-vivo (i.e., in conjunction with bone marrow transplantation or withperipheral progenitor cell transplantation (homologous or heterologous))as normal cells or genetically manipulated for gene therapy.Alternatively, as described in more detail below, genes encoding theseproteins or nucleic acids regulating the expression of these proteinsmay be introduced into appropriate host cells to increase or decreasethe expression of the proteins as desired.

EXAMPLE 35 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Regulation of Tissue Growth

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for their effect on tissue growth. Numerous assays for suchactivity are familiar to those skilled in the art, including the assaysdisclosed in International Patent Publication No. WO95/16035,International Patent Publication No. WO95/05846 and International PatentPublication No. WO91/07491, which are incorporated herein by reference.

Assays for wound healing activity include, without limitation, thosedescribed in: Winter, Epidermal Wound Healing, pps. 71-112 (Maibach, H Iand Rovee, D T, eds.), Year Book Medical Publishers, Inc., Chicago, asmodified by Eaglstein and Mertz, J. Invest. Dermatol 71:382-84 (1978)which are incorporated herein by reference.

Those proteins which are involved in the regulation of tissue growth maythen be formulated as pharmaceuticals and used to treat clinicalconditions in which regulation of tissue growth is beneficial. Forexample, a protein of the present invention also may have utility incompositions used for bone, cartilage, tendon, ligament and/or nervetissue growth or regeneration, as well as for wound healing and tissuerepair and replacement, and in the treatment of burns, incisions andulcers.

A protein of the present invention, which induces cartilage and/or bonegrowth in circumstances where bone is not normally formed, hasapplication in the healing of bone fractures and cartilage damage ordefects in humans and other animals. Such a preparation employing aprotein of the invention may have prophylactic use in closed as well asopen fracture reduction and also in the improved fixation of artificialjoints. De novo bone formation induced by an osteogenic agentcontributes to the repair of congenital, trauma induced, or oncologicresection induced craniofacial defects, and also is useful in cosmeticplastic surgery.

A protein of this invention may also be used in the treatment ofperiodontal disease, and in other tooth repair processes. Such agentsmay provide an environment to attract bone-forming cells, stimulategrowth of bone-forming cells or induce differentiation of progenitors ofbone-forming cells. A protein of the invention may also be useful in thetreatment of osteoporosis or osteoarthritis, such as through stimulationof bone and/or cartilage repair or by blocking inflammation or processesof tissue destruction (collagenase activity, osteoclast activity, etc.)mediated by inflammatory processes.

Another category of tissue regeneration activity that may beattributable to the protein of the present invention is tendon/ligamentformation. A protein of the present invention, which inducestendon/ligament-like tissue or other tissue formation in circumstanceswhere such tissue is not normally formed, has application in the healingof tendon or ligament tears, deformities and other tendon or ligamentdefects in humans and other animals. Such a preparation employing atendon/ligament-like tissue inducing protein may have prophylactic usein preventing damage to tendon or ligament tissue, as well as use in theimproved fixation of tendon or ligament to bone or other tissues, and inrepairing defects to tendon or ligament tissue. De novotendon/ligament-like tissue formation induced by a composition of thepresent invention contributes to the repair of congenital, traumainduced, or other tendon or ligament defects of other origin, and isalso useful in cosmetic plastic surgery for attachment or repair oftendons or ligaments. The compositions of the present invention mayprovide an environment to attract tendon- or ligament-forming cells,stimulate growth of tendon- or ligament-forming cells, inducedifferentiation of progenitors of tendon- or ligament-forming cells, orinduce growth of tendon/ligament cells or progenitors ex vivo for returnin vivo to effect tissue repair. The compositions of the invention mayalso be useful in the treatment of tendinitis, carpal tunnel syndromeand other tendon or ligament defects. The compositions may also includean appropriate matrix and/or sequestering agent as a carrier as is wellknown in the art.

The protein of the present invention may also be useful forproliferation of neural cells and for regeneration of nerve and braintissue, i.e., for the treatment of central and peripheral nervous systemdiseases and neuropathies, as well as mechanical and traumaticdisorders, which involve degeneration, death or trauma to neural cellsor nerve tissue. More specifically, a protein may be used in thetreatment of diseases of the peripheral nervous system, such asperipheral nerve injuries, peripheral neuropathy and localizedneuropathies, and central nervous system diseases, such as Alzheimer's,Parkinson's disease, Huntington's disease, amyotrophic lateralsclerosis, and Shy-Drager syndrome. Further conditions which may betreated in accordance with the present invention include mechanical andtraumatic disorders, such as spinal cord disorders, head trauma andcerebrovascular diseases such as stroke. Peripheral neuropathiesresulting from chemotherapy or other medical therapies may also betreatable using a protein of the invention.

Proteins of the invention may also be useful to promote better or fasterclosure of non-healing wounds, including without limitation pressureulcers, ulcers associated with vascular insufficiency, surgical andtraumatic wounds, and the like.

It is expected that a protein of the present invention may also exhibitactivity for generation or regeneration of other tissues, such as organs(including, for example, pancreas, liver, intestine, kidney, skin,endothelium) muscle (smooth, skeletal or cardiac) and vascular(including vascular endothelium) tissue, or for promoting the growth ofcells comprising such tissues. Part of the desired effects may be byinhibition or modulation of fibrotic scarring to allow normal tissue togenerate. A protein of the invention may also exhibit angiogenicactivity.

A protein of the present invention may also be useful for gut protectionor regeneration and treatment of lung or liver fibrosis, reperfusioninjury in various tissues, and conditions resulting from systemiccytokinc damage.

A protein of the present invention may also be useful for promoting orinhibiting differentiation of tissues described above from precursortissues or cells; or for inhibiting the growth of tissues describedabove.

Alternatively, as described in more detail below, genes encoding theseproteins or nucleic acids regulating the expression of these proteinsmay be introduced into appropriate host cells to increase or decreasethe expression of the proteins as desired.

EXAMPLE 36 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Regulation of Reproductive Hormones or CellMovement

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for their ability to regulate reproductive hormones, suchas follicle stimulating hormone. Numerous assays for such activity arefamiliar to those skilled in the art, including the assays disclosed inthe following references, which are incorporated herein by reference:Vale et al., Endocrinology 91:562-572, 1972; Ling et al., Nature321:779-782, 1986; Vale et al., Nature 321:776-779, 1986; Mason et al.,Nature 318:659-663, 1985; Forage et al., Proc. Natl. Acad. Sci. USA83:3091-3095, 1986. Chapter 6.12 (Measurement of Alpha and BetaChemokines) Current Protocols in Immunology, J. E. Coligan et al. Eds.Greene Publishing Associates and Wiley-Intersciece; Taub et al. J. Clin.Invest. 95:1370-1376, 1995; Lind et al. APMIS 103:140-146, 1995; Mulleret al. Eur. J. Immunol. 25:1744-1748; Gruber et al. J. of Immunol.152:5860-5867, 1994; Johnston et al. J. of Immunol. 153:1762-1768, 1994.

Those proteins which exhibit activity as reproductive hormones orregulators of cell movement may then be formulated as pharmaceuticalsand used to treat clinical conditions in which regulation ofreproductive hormones or cell movement are beneficial. For example, aprotein of the present invention may also exhibit activin- orinhibin-related activities. Inhibins are characterized by their abilityto inhibit the release of follicle stimulating hormone (FSH), whileactivins are characterized by their ability to stimulate the release offolic stimulating hormone (FSH). Thus, a protein of the presentinvention, alone or in heterodimers with a member of the inhibin αfamily, may be useful as a contraceptive based on the ability ofinhibins to decrease fertility in female mammals and decreasespermatogenesis in male mammals. Administration of sufficient amounts ofother inhibins can induce infertility in these mammals. Alternatively,the protein of the invention, as a homodimer or as a heterodimer withother protein subunits of the inhibin-B group, may be useful as afertility inducing therapeutic, based upon the ability of activinmolecules in stimulating FSH release from cells of the anteriorpituitary. See, for example, U.S. Pat. No. 4,798,885, the disclosure ofwhich is incorporated herein by reference. A protein of the inventionmay also be useful for advancement of the onset of fertility in sexuallyimmature mammals, so as to increase the lifetime reproductiveperformance of domestic animals such as cows, sheep and pigs.

Alternatively, as described in more detail below, genes encoding theseproteins or nucleic acids regulating the expression of these proteinsmay be introduced into appropriate host cells to increase or decreasethe expression of the proteins as desired.

EXAMPLE 36A Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Chemotactic/Chemokinetic Activity

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for chemotactic/chemokinetic activity. For example, aprotein of the present invention may have chemotactic or chemokineticactivity (e.g., act as a chemokine) for mammalian cells, including, forexample, monocytes, fibroblasts, neutrophils, T-cells, mast cells,cosinophils, epithelial and/or endothelial cells. Chemotactic andchmokinetic proteins can be used to mobilize or attract a desired cellpopulation to a desired site of action. Chemotactic or chemokineticproteins provide particular advantages in treatment of wounds and othertrauma to tissues, as well as in treatment of localized infections. Forexample, attraction of lymphocytes, monocytes or neutrophils to tumorsor sites of infection may result in improved immune responses againstthe tumor or infecting agent.

A protein or peptide has chemotactic activity for a particular cellpopulation if it can stimulate, directly or indirectly, the directedorientation or movement of such cell population. Preferably, the proteinor peptide has the ability to directly stimulate directed movement ofcells. Whether a particular protein has chemotactic activity for apopulation of cells can be readily determined by employing such proteinor peptide in any known assay for cell chemotaxis.

The activity of a protein of the invention may, among other means, bemeasured by the following methods:

Assays for chemotactic activity (which will identify proteins thatinduce or prevent chemotaxis) consist of assays that measure the abilityof a protein to induce the migration of cells across a membrane as wellas the ability of a protein to induce the adhension of one cellpopulation to another cell population. Suitable assays for movement andadhesion include, without limitation, those described in: CurrentProtocols in Immunology, Ed by J. E. Coligan, A. M. Kruisbeek, D. H.Margulies, E. M. Shevach, W. Strober, Pub. Greene Publishing Associatesand Wiley-Interscience (Chapter 6.12, Measurement of alpha and betaChemokincs 6.12.1-6.12.28; Taub et al. J. Clin. Invest. 95:1370-1376,1995; Lind et al. APMIS 103:140-146, 1995; Mueller et al Eur. J.Immunol. 25:1744-1748; Gruber et al. J. of Immunol. 152:5860-5867, 1994;Johnston et al. J. of Immunol, 153:1762-1768, 1994.

EXAMPLE 37 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Regulation of Blood Clotting

The proteins encoded by the extended cDNAs or portions thereof may alsobe evaluated for their effects on blood clotting. Numerous assays forsuch activity are familiar to those skilled in the art, including theassays disclosed in the following references, which are incorporatedherein by reference: Linet et al., J. Clin. Pharmacol. 26:131-140, 1986;Burdick et al., Thrombosis Res. 45:413-419, 1987; Humphrey et al.,Fibrinolysis 5:71-79 (1991); Schaub, Prostaglandins 35:467-474, 1988.

Those proteins which are involved in the regulation of blood clottingmay then be formulated as pharmaceuticals and used to treat clinicalconditions in which regulation of blood clotting is beneficial. Forexample, a protein of the invention may also exhibit hemostatic orthrombolytic activity. As a result, such a protein is expected to beuseful in treatment of various coagulations disorders (includinghereditary disorders, such as hemophilias) or to enhance coagulation andother hemostatic events in treating wounds resulting from trauma,surgery or other causes. A protein of the invention may also be usefulfor dissolving or inhibiting formation of thromboses and for treatmentand prevention of conditions resulting therefrom (such as, for example,infarction of cardiac and central nervous system vessels (e.g., stroke).Alternatively, as described in more detail below, genes encoding theseproteins or nucleic acids regulating the expression of these proteinsmay be introduced into appropriate host cells to increase or decreasethe expression of the proteins as desired.

EXAMPLE 38 Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Involvement in Receptor/Ligand Interactions

The proteins encoded by the extended cDNAs or a portion thereof may alsobe evaluated for their involvement in receptor/ligand interactions.Numerous assays for such involvement are familiar to those skilled inthe art, including the assays disclosed in the following references,which are incorporated herein by reference: Chapter 7.28 (Measurement ofCellular Adhesion under Static Conditions 7.28.1-7.28.22) in CurrentProtocols in Immunology, J. E. Coligan et al. Eds. Greene PublishingAssociates and Wiley-Interscience; Takai et al., Proc. Natl. Acad. Sci.USA 84:6864-6868, 1987; Bierer et al., J. Exp. Med. 168:1145-1156, 1988;Rosenstein et al., J. Exp. Med. 169:149-160, 1989; Stoltenborg et al.,J. Immunol. Methods 175:59-68, 1994; Stitt et al., Cell 80:661-670,1995; Gyuris et al., Cell 75:791-803, 1993.

For example, the proteins of the present invention may also demonstrateactivity as receptors, receptor ligands or inhibitors or agonists ofreceptor/ligand interactions. Examples of such receptors and ligandsinclude, without limitation, cytokine receptors and their ligands,receptor kinases and their ligands, receptor phosphatases and theirligands, receptors involved in cell-cell interactions and their ligands(including without limitation, cellular adhesion molecules (such asselecting, integrins and their ligands) and receptor/ligand pairsinvolved in antigen presentation, antigen recognition and development ofcellular and humoral immune respones). Receptors and ligands are alsouseful for screening of potential peptide or small molecule inhibitorsof the relevant receptor/ligand interaction. A protein of the presentinvention (including, without limitation, fragments of receptors andligands) may themselves be useful as inhibitors of receptor/ligandinteractions.

EXAMPLE 38A Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Anti-Inflammatory Activity

The proteins encoded by the extended cDNAs or a portion thereof may alsobe evaluated for anti-inflammatory activity. The anti-inflammatoryactivity may be achieved by providing a stimulus to cells involved inthe inflammatory response, by inhibiting or promoting cell-cellinteractions (such as, for example, cell adhesion), by inhibiting orpromoting chemotaxis of cells involved in the inflammatory process,inhibiting or promoting cell extravasation, or by stimulating orsuppressing production of other factors which more directly inhibit orpromote an inflammatory response. Proteins exhibiting such activitiescan be used to treat inflammatory conditions including chronic or acuteconditions), including without limitation inflammation associated withinfection (such as septic shock, sepsis or systemic inflammatoryresponse syndrome (SIRS)), ischemia-reperfusioninury, endotoxinlethality, arthritis, complement-mediated hyperacute rejection,nephritis, cytokine or chemokine-induced lung injury, inflammatory boweldisease, Crohn's disease or resulting from over production of cytokinessuch as TNF or IL-1. Proteins of the invention may also be useful totreat anaphylaxis and hypersensitivity to an antigenic substance ormaterial.

EXAMPLE 38B Assaying the Proteins Expressed from Extended cDNAs orPortions Thereof for Tumor Inhibition Activity

The proteins encoded by the extended cDNAs or a portion thereof may alsobe evaluated for tumor inhibition activity. In addition to theactivities described above for immunological treatment or prevention oftumors, a protein of the invention may exhibit other anti-tumoractivities. A protein may inhibit tumor growth directly or indirectly(such as, for example, via ADCC). A protein may exhibit its tumorinhibitory activity by acting on tumor tissue or tumor precursor tissue,by inhibiting formation of tissues necessary to support tumor growth(such as, for example, by inhibiting angiogenesis), by causingproduction of other factors, agents or cell types which inhibit tumorgrowth, or by suppressing, climinating or inhibiting factors, agents orcell types which promote tumor growth.

A protein of the invention may also exhibit one or more of the followingadditional activities or effects: inhibiting the growth, infection orfunction of, or killing, infectious agents, including, withoutlimitation, bacteria, viruses, fungi and other parasites; effecting(suppressing or enhancing) bodily characteristics, including, withoutlimitation, height, weight, hair color, eye color, skin, fat to leanratio or other tissue pigmentation, or organ or body part size or shape(such as, for example, breast augmentation or diminution, change in boneform or shape); effecting biorhythms or circadian cycles or rhythms;effecting the fertility of male or female subjects; effecting themetabolism, catabolism, anabolism, processing, utilization, storage orclimination of dietary fat, lipid, protein, carbohydrate, vitamins,minerals, cofactors or other nutritional factors or component(s);effecting behavioral characteristics, including, without limitation,appetite, libido, stress, cognition (including cognitive disorders),depression (including depressive disorders) and violent behaviors;providing analgesic effects or other pain reducing effects; promotingdifferentiation and growth of embryonic stem cells in lineages otherthan hematopoietic lineages; hormonal or endocrine activity; in the caseof enzymes, correcting deficiencies of the enzyme and treatingdeficiency-related diseases; treatment of hyperproliferative disorders(such as, for example, psoriasis); immunoglobulin-like activity (suchas, for example, the ability to bind antigens or complement); and theability to act as an antigen in a vaccine composition to raise an immuneresponse against such protein or another material or entity which iscross-reactive with such protein.

EXAMPLE 39 Identification of Proteins which Interact with PolypeptidesEncoded by Extended cDNAs

Proteins which interact with the polypeptides encoded by extended cDNAsor portions thereof, such as receptor proteins, may be identified usingtwo hybrid systems such as the MATCHMAKER TWO HYBRID SYSTEM 2 (CatalogNo. K1604-1, Clontech). As described in the manual accompanying theMATCHMAKER TWO HYBRID SYSTEM 2 (Catalog No. K1604-1, Clontech), which isincorporated herein by reference, the extended cDNAs or portionsthereof, are inserted into an expression vector such that they are inframe with DNA encoding the DNA binding domain of the yeasttranscriptional activator GAL4. cDNAs in a cDNA library which encodeproteins which might interact with the polypeptides encoded by theextended cDNAs or portions thereof are inserted into a second expressionvector such that they are in frame with DNA encoding the activationdomain of GAL4. The two expression plasmids are transformed into yeastand the yeast are plated on selection medium which selects forexpression of selectable markers on each of the expression vectors aswell as GAL4 dependent expression of the HIS3 gene. Transformantscapable of growing on medium lacking histidine are screened for GAL4dependent lacZ expression. Those cells which are positive in both thehistidine selection and the lacZ assay contain plasmids encodingproteins which interact with the polypeptide encoded by the extendedcDNAs or portions thereof.

Alternatively, the system described in Lustig et al., Methods inEnzymology 283: 83-99 (1997), the disclosure of which is incorporatedherein by reference, may be used for identifying molecules whichinteract with the polypeptides encoded by extended cDNAs. In suchsystems, in vitro transcription reactions are performed on a pool ofvectors containing extended cDNA inserts cloned downstream of a promoterwhich drives in vitro transcription. The resulting pools of mRNAs areintroduced into Xenopus laevis oocytes. The oocytes are then assayed fora desired acitivity.

Alternatively, the pooled in vitro transcription products produced asdescribed above may be translated in vitro. The pooled in vitrotranslation products can be assayed for a desired activity or forinteraction with a known polypeptide.

Proteins or other molecules interacting with polypeptides encoded byextended cDNAs can be found by a variety of additional techniques. Inone method, affinity columns containing the polypeptide encoded by theextended cDNA or a portion thereof can be constructed. In some versions,of this method the affinity column contains chimeric proteins in whichthe protein encoded by the extended cDNA or a portion thereof is fusedto glutathione S-transferase. A mixture of cellular proteins or pool ofexpressed proteins as described above and is applied to the affinitycolumn. Proteins interacting with the polypeptide attached to the columncan then be isolated and analyzed on 2-D electrophoresis gel asdescribed in Ramunsen et al. Electrophoresis, 18, 588-598 (1997), thedisclosure of which is incorporated herein by reference. Alternatively,the proteins retained on the affinity column can be purified byelectrophoresis based methods and sequenced. The same method can be usedto isolate antibodies, to screen phage display products, or to screenphage display human antibodies.

Proteins interacting with polypeptides encoded by extended cDNAs orportions thereof can also be screened by using an Optical Biosensor asdescribed in Edwards & Leatherbarrow, Analytical Biochemistry, 246, 1-6(1997), the disclosure of which is incorporated herein by reference. Themain advantage of the method is that it allows the determination of theassociation rate between the protein and other interacting molecules.Thus, it is possible to specifically select interacting molecules with ahigh or low association rate. Typically a target molecule is linked tothe sensor surface (through a carboxymethl dextran matrix) and a sampleof test molecules is placed in contact with the target molecules. Thebinding of a test molecule to the target molecule causes a change in therefractive index and/or thickness. This change is detected by theBiosensor provided it occurs in the evanescent field (which extend a fewhundred manometers from the sensor surface). In these screening assays,the target molecule can be one of the polypeptides encoded by extendedcDNAs or a portion thereof and the test sample can be a collection ofproteins extracted from tissues or cells, a pool of expressed proteins,combinatorial peptide and/or chemical libraries, or phage displayedpeptides. The tissues or cells from which the test proteins areextracted can originate from any species.

In other methods, a target protein is immobilized and the testpopulation is a collection of unique polypeptides encoded by theextended cDNAs or portions thereof.

To study the interaction of the proteins encoded by the extended cDNAsor portions thereof with drugs, the microdialysis coupled to HPLC methoddescribed by Wang et al., Chromatographia, 44, 205-208(1997) or theaffinity capillary electrophoresis method described by Busch et al., J.Chromatogr. 777:311-328 (1997), the disclosures of which areincorporated herein by referenc can be used.

The system described in U.S. Pat. No. 5,654,150, the disclosure of whichis incorporated herein by reference, may also be used to identifymolecules which interact with the polypeptides encoded by the extendedcDNAs. In this system, pools of extended cDNAs are transcribed andtranslated in vitro and the reaction products are assayed forinteraction with a known polypeptide or antibody.

It will be appreciated by those skilled in the art that the proteinsexpressed from the extended cDNAs or portions may be assayed fornumerous activities in addition to those specifically enumerated above.For example, the expressed proteins may be evaluated for applicationsinvolving control and regulation of inflammation, tumor proliferation ormetastasis, infection, or other clinical conditions. In addition, theproteins expressed from the extended cDNAs or portions thereof may beuseful as nutritional agents or cosmetic agents.

The proteins expressed from the extended cDNAs or portions thereof maybe used to generate antibodies capable of specifically binding to theexpressed protein or fragments thereof as described in Example 40 below.The antibodies may be capable of binding a full length protein encodedby one of the sequences of SEQ ID NOs: 40-59, 61-73, 75, 77-82, and130-154, a mature protein encoded by one of the sequences of SEQ ID NOs.40-59, 61-75, 77-82, and 130-154, or a signal peptide encoded by one ofthe sequences of SEQ ID Nos. 40-59, 61-73, 75-82, 84 and 130-154.Alternatively, the antibodies may be capable of binding fragments of theproteins expressed from the extended cDNAs which comprise at least 10amino acids of the sequences of SEQ ID NOs: 85-129 and 155-179. In someembodiments, the antibodies may be capable of binding fragments of theproteins expressed from the extended cDNAs which comprise at least 15amino acids of the sequences of SEQ ID NOs: 85-129 and 155-179. In otherembodiments, the antibodies may be capable of binding fragments of theproteins expressed from the extended cDNAs which comprise at least 25amino acids of the sequences of SEQ ID NOs: 85-129 and 155-179. Infurther embodiments, the antibodies may be capable of binding fragmentsof the proteins expressed from the extended cDNAs which comprise atleast 40 amino acids of the sequences of SEQ ID NOs: 85-129 and 155-179.

EXAMPLE 40 Production of an Antibody to a Human Protein

Substantially pure protein or polypeptide is isolated from thetransfected or transformed cells as described in Example 30. Theconcentration of protein in the final preparation is adjusted, forexample, by concentration on an Amicon filter device, to the level of afew micrograms/ml. Monoclonal or polyclonal antibody to the protein canthen be prepared as follows:

A. Monoclonal Antibody Production by Hybridoma Fusion

Monoclonal antibody to epitopes of any of the peptides identified andisolated as described can be prepared from murine hybridomas accordingto the classical method of Kohler, G. and Milstein, C., Nature 256:495(1975) or derivative methods thereof. Briefly, a mouse is repetitivelyinoculated with a few micrograms of the selected protein or peptidesderived therefrom over a period of a few weeks. The mouse is thensacrificed, and the antibody producing cells of the spleen isolated. Thespleen cells are fused by means of polyethylene glycol with mousemyeloma cells, and the excess unfused cells destroyed by growth of thesystem on selective media comprising aminopterin (HAT media). Thesuccessfully fused cells are diluted and aliquots of the dilution placedin wells of a microtiter plate where growth of the culture is continued.Antibody-producing clones are identified by detection of antibody in thesupernatant fluid of the wells by immunoassay procedures, such as Elisa,as originally described by Engvall, E., Meth. Enzymol. 70:419 (1980),and derivative methods thereof. Selected positive clones can be expandedand their monoclonal antibody product harvested for use. Detailedprocedures for monoclonal antibody production are described in Davis, L.et al. Basic Methods in Molecular Biology Elsevier, New York. Section21-2.

B. Polyclonal Antibody Production by Immunization

Polyclonal antiserum containing antibodies to heterogeneous epitopes ofa single protein can be prepared by immunizing suitable animals with theexpressed protein or peptides derived therefrom described above, whichcan be unmodified or modified to enhance immunogenicity. Effectivepolyclonal antibody production is affected by many factors related bothto the antigen and the host species. For example, small molecules tendto be less immunogenic than others and may require the use of carriersand adjuvant. Also, host animals vary in response to site ofinoculations and dose, with both inadequate or excessive doses ofantigen resulting in low titer antisera. Small doses (ng level) ofantigen administered at multiple intradermal sites appears to be mostreliable. An effective immunization protocol for rabbits can be found inVaitukaitis, J. et al. J. Clin. Endocrinol. Metab. 33:988-991 (1971).

Booster injections can be given at regular intervals, and antiserumharvested when antibody titer thereof, as determinedsemi-quantitatively, for example, by double immunodiffusion in agaragainst known concentrations of the antigen, begins to fall. See, forexample, Ouchterlony, O. et al., Chap. 19 in: Handbook of ExperimentalImmunology D. Wier (ed) Blackwell (1973). Plateau concentration ofantibody is usually in the range of 0.1 to 0.2 mg/ml of serum (about 12μM). Affinity of the antisera for the antigen is determined by preparingcompetitive binding curves, as described, for example, by Fisher, D.,Chap. 42 in: Manual of Clinical Immunology, 2d Ed. (Rose and Friedman,Eds.) Amer. Soc. For Microbiol., Washington, D.C. (1980).

Antibody preparations prepared according to either protocol are usefulin quantitative immunoassays which determine concentrations ofantigen-bearing substances in biological samples; they are also usedsemi-quantitatively or qualitatively to identify the presence of antigenin a biological sample. The antibodies may also be used in therapeuticcompositions for killing cells expressing the protein or reducing thelevels of the protein in the body.

V. Use of Extended cDNAs or Portions Thereof as Reagents

The extended cDNAs of the present invention may be used as reagents inisolation procedures, diagnostic assays, and forensic procedures. Forexample, sequences from the extended cDNAs (or genomic DNAs obtainabletherefrom) may be detectably labeled and used as probes to isolate othersequences capable of hybridizing to them. In addition, sequences fromthe extended cDNAs (or genomic DNAs obtainable therefrom) may be used todesign PCR primers to be used in isolation, diagnostic, or forensicprocedures.

EXAMPLE 41 Preparation of PCR Primers and Amplification of DNA

The extended cDNAs (or genomic DNAs obtainable therefrom) may be used toprepare PCR primers for a variety of applications, including isolationprocedures for cloning nucleic acids capable of hybridizing to suchsequences, diagnostic techniques and forensic techniques. The PCRprimers are at least 10 bases, and preferably at least 12, 15, or 17bases in length. More preferably, the PCR primers are at least 20-30bases in length. In some embodiments, the PCR primers may be more than30 bases in length. It is preferred that the primer pairs haveapproximately the same G/C ratio, so that melting temperatures areapproximately the same. A variety of PCR techniques are familiar tothose skilled in the art. For a review of PCR technology, see MolecularCloning to Genetic Engineering White, B. A. Ed. in Methods in MolecularBiology 67: Humana Press, Totowa 1997. In each of these PCR procedures,PCR primers on either side of the nucleic acid sequences to be amplifiedare added to a suitably prepared nucleic acid sample along with dNTPsand a thermostable polymerase such as Taq polymerase, Pfu polymerase, orVent polymerase. The nucleic acid in the sample is denatured and the PCRprimers are specifically hybridized to complementary nucleic acidsequences in the sample. The hybridized primers are extended.Thereafter, another cycle of denaturation, hybridization, and extensionis initiated. The cycles are repeated multiple times to produce anamplified fragment containing the nucleic acid sequence between theprimer sites.

EXAMPLE 42 Use of Extended cDNAs as Probes

Probes derived from extended cDNAs or portions thereof (or genomic DNAsobtainable therefrom) may be labeled with detectable labels familiar tothose skilled in the art, including radioisotopes and non-radioactivelabels, to provide a detectable probe. The detectable probe may besingle stranded or double stranded and may be made using techniquesknown in the art, including in vitro transcription, nick translation, orkinase reactions. A nucleic acid sample containing a sequence capable ofhybridizing to the labeled probe is contacted with the labeled probe. Ifthe nucleic acid in the sample is double stranded, it may be denaturedprior to contacting the probe. In some applications, the nucleic acidsample may be immobilized on a surface such as a nitrocellulose or nylonmembrane. The nucleic acid sample may comprise nucleic acids obtainedfrom a variety of sources, including genomic DNA, cDNA libraries, RNA,or tissue samples.

Procedures used to detect the presence of nucleic acids capable ofhybridizing to the detectable probe include well known techniques suchas Southern blotting, Northern blotting, dot blotting, colonyhybridization, and plaque hybridization. In some applications, thenucleic acid capable of hybridizing to the labeled probe may be clonedinto vectors such as expression vectors, sequencing vectors, or in vitrotranscription vectors to facilitate the characterization and expressionof the hybridizing nucleic acids in the sample. For example, suchtechniques may be used to isolate and clone sequences in a genomiclibrary or cDNA library which are capable of hybridizing to thedetectable probe as described in Example 30 above.

PCR primers made as described in Example 41 above may be used inforensic analyses, such as the DNA fingerprinting techniques describedin Examples 43-47 below. Such analyses may utilize detectable probes orprimers based on the sequences of the extended cDNAs isolated using the5′ ESTs (or genomic DNAs obtainable therefrom).

EXAMPLE 43 Forensic Matching by DNA Sequencing

In one exemplary method, DNA samples are isolated from forensicspecimens of, for example, hair, semen, blood or skin cells byconventional methods. A panel of PCR primers based on a number of theextended cDNAs (or genomic DNAs obtainable therefrom), is then utilizedin accordance with Example 41 to amplify DNA of approximately 100-200bases in length from the forensic specimen. Corresponding sequences areobtained from a test subject. Each of these identification DNAs is thensequenced using standard techniques, and a simple database comparisondetermines the differences, if any, between the sequences from thesubject and those from the sample. Statistically significant differencesbetween the suspect's DNA sequences and those from the sampleconclusively prove a lack of identity. This lack of identity can beproven, for example, with only one sequence. Identity, on the otherhand, should be demonstrated with a large number of sequences, allmatching. Preferably, a minimum of 50 statistically identical sequencesof 100 bases in length are used to prove identity between the suspectand the sample.

EXAMPLE 44 Positive Identification by DNA Sequencing

The technique outlined in the previous example may also be used on alarger scale to provide a unique fingerprint-type identification of anyindividual. In this technique, primers are prepared from a large numberof sequences from Table IV and the appended sequence listing.Preferably, 20 to 50 different primers are used. These primers are usedto obtain a corresponding number of PCR-generated DNA segments from theindividual in question in accordance with Example 41. Each of these DNAsegments is sequenced, using the methods set forth in Example 43. Thedatabase of sequences generated through this procedure uniquelyidentifies the individual from whom the sequences were obtained. Thesame panel of primers may then be used at any later time to absolutelycorrelate tissue or other biological specimen with that individual.

EXAMPLE 45 Southern Blot Forensic Identification

The procedure of Example 44 is repeated to obtain a panel of at least 10amplified sequences from an individual and a specimen. Preferably, thepanel contains at least 50 amplified sequences. More preferably, thepanel contains 100 amplified sequences. In some embodiments, the panelcontains 200 amplified sequences. This PCR-generated DNA is thendigested with one or a combination of, preferably, four base specificrestriction enzymes. Such enzymes are commercially available and knownto those of skill in the art. After digestion, the resultant genefragments are size separated in multiple duplicate wells on an agarosegel and transferred to nitrocellulose using Southern blotting techniqueswell known to those with skill in the art. For a review of Southernblotting see Davis et al. (Basic Methods in Molecular Biology, 1986,Elsevier Press. pp 62-65).

A panel of probes based on the sequences of the extended cDNAs (orgenomic DNAs obtainable therefrom), or fragments thereof of at least 10bases, are radioactively or calorimetrically labeled using methods knownin the art, such as nick translation or end labeling, and hybridized tothe Southern blot using techniques known in the art (Davis et al.,supra). Preferably, the probe comprises at least 12, 15, or 17consecutive nucleotides from the extended cDNA (or genomic DNAsobtainable therefrom). More preferably, the probe comprises at least20-30 consecutive nucleotides from the extended cDNA (or genomic DNAsobtainable therefrom). In some embodiments, the probe comprises morethan 30 nucleotides from the extended cDNA (or genomic DNAs obtainabletherefrom). In other embodiments, the probe comprises at least 40, atleast 50, at least 75, at least 100, at least 150, or at least 200consecutive nucleotides from the extended cDNA (or genomic DNAsobtainable therefrom).

Preferably, at least 5 to 10 of these labeled probes are used, and morepreferably at least about 20 or 30 are used to provide a unique pattern.The resultant bands appearing from the hybridization of a large sampleof extended cDNAs (or genomic DNAs obtainable therefrom) will be aunique identifier. Since the restriction enzyme cleavage will bedifferent for every individual, the band pattern on the Southern blotwill also be unique. Increasing the number of extended cDNA probes willprovide a statistically higher level of confidence in the identificationsince there will be an increased number of sets of bands used foridentification.

EXAMPLE 46 Dot Blot Identification Procedure

Another technique for identifying individuals using the extended cDNAsequences disclosed herein utilizes a dot blot hybridization technique.

Genomic DNA is isolated from nuclei of subject to be identified.Oligonucleotide probes of approximately 30 bp in length are synthesizedthat correspond to at least 10, preferably 50 sequences from theextended cDNAs or genomic DNAs obtainable therefrom. The probes are usedto hybridize to the genomic DNA through conditions known to those in theart. The oligonucleotides are end labeled with p³² using polynucleotidekinase (Pharmacia). Dot Blots are created by spotting the genomic DNAonto nitrocellulose or the like using a vacuum dot blot manifold(BioRad, Richmond Calif.). The nitrocellulose filter containing thegenomic sequences is baked or UV linked to the filter, prehybridized andhybridized with labeled probe using techniques known in the art (Daviset al. supra). The ³²P labeled DNA fragments are sequentially hybridizedwith successively stringent conditions to detect minimal differencesbetween the 30 bp sequence and the DNA. Tetramethylammonium chloride isuseful for identifying clones containing small numbers of nucleotidemismatches (Wood et al., Proc. Natl. Acad. Sci. USA 82(6):1585-1588(1985)) which is hereby incorporated by reference. A unique pattern ofdots distinguishes one individual from another individual.

Extended cDNAs or oligonucleotides containing at least 10 consecutivebases from these sequences can be used as probes in the followingalternative fingerprinting technique. Preferably, the probe comprises atleast 12, 15, or 17 consecutive nucleotides from the extended cDNA (orgenomic DNAs obtainable therefrom). More preferably, the probe comprisesat least 20-30 consecutive nucleotides from the extended cDNA (orgenomic DNAs obtainable therefrom). In some embodiments, the probecomprises more than 30 nucleotides from the extended cDNA (or genomicDNAs obtainable therefrom). In other embodiments, the probe comprises atleast 40, at least 50, at least 75, at least 100, at least 150, or atleast 200 consecutive nucleotides from the extended cDNA (or genomicDNAs obtainable therefrom).

Preferably, a plurality of probes having sequences from different genesare used in the alternative fingerprinting technique. Example 47 belowprovides a representative alternative fingerprinting procedure in whichthe probes are derived from extended cDNAs.

EXAMPLE 47 Alternative “Fingerprint” Identification Technique

20-mer oligonucleotides are prepared from a large number, e.g. 50, 100,or 200, of extended cDNA sequences (or genomic DNAs obtainabletherefrom) using commercially available oligonucleotide services such asGenset, Paris, France. Cell samples from the test subject are processedfor DNA using techniques well known to those with skill in the art. Thenucleic acid is digested with restriction enzymes such as EcoRI andXbaI. Following digestion, samples are applied to wells forelectrophoresis. The procedure, as known in the art, may be modified toaccommodate polyacrylamide electrophoresis, however in this example,samples containing 5 ug of DNA are loaded into wells and separated on0.8% agarose gels. The gels are transferred onto nitrocellulose usingstandard Southern blotting techniques.

10 ng of each of the oligonucleotides are pooled and end-labeled withP³². The nitrocellulose is prehybridized with blocking solution andhybridized with the labeled probes. Following hybridization and washing,the nitrocellulose filter is exposed to X-Omat AR X-ray film. Theresulting hybridization pattern will be unique for each individual.

It is additionally contemplated within this example that the number ofprobe sequences used can be varied for additional accuracy or clarity.

The antibodies generated in Examples 30 and 40 above may be used toidentify the tissue type or cell species from which a sample is derivedas described above.

EXAMPLE 48 Identification of Tissue Types or Cell Species by Means ofLabeled Tissue Specific Antibodies

Identification of specific tissues is accomplished by the visualizationof tissue specific antigens by means of antibody preparations accordingto Examples 30 and 40 which are conjugated, directly or indirectly to adetectable marker. Selected labeled antibody species bind to theirspecific antigen binding partner in tissue sections, cell suspensions,or in extracts of soluble proteins from a tissue sample to provide apattern for qualitative or semi-qualitative interpretation.

Antisera for these procedures must have a potency exceeding that of thenative preparation, and for that reason, antibodies are concentrated toa mg/ml level by isolation of the gamma globulin fraction, for example,by ion-exchange chromatography or by ammonium sulfate fractionation.Also, to provide the most specific antisera, unwanted antibodies, forexample to common proteins, must be removed from the gamma globulinfraction, for example by means of insoluble immunoabsorbents, before theantibodies are labeled with the marker. Either monoclonal orheterologous antisera is suitable for either procedure.

A. Immunohistochemical Techniques

Purified, high-titer antibodies, prepared as described above, areconjugated to a detectable marker, as described, for example, byFudenberg, H., Chap. 26 in: Basic 503 Clinical Immunology, 3rd Ed.Lange, Los Altos, Calif. (1980) or Rose, N. et al., Chap. 12 in: Methodsin Immunodiagnosis, 2d Ed. John Wiley 503 Sons, New York (1980).

A fluorescent marker, either fluorescein or rhodamine, is preferred, butantibodies can also be labeled with an enzyme that supports a colorproducing reaction with a substrate, such as horseradish peroxidase.Markers can be added to tissue-bound antibody in a second step, asdescribed below. Alternatively, the specific antitissue antibodies canbe labeled with ferritin or other electron dense particles, andlocalization of the ferritin coupled antigen-antibody complexes achievedby means of an electron microscope. In yet another approach, theantibodies are radiolabeled, with, for example ¹²⁵I, and detected byoverlaying the antibody treated preparation with photographic emulsion.

Preparations to carry out the procedures can comprise monoclonal orpolyclonal antibodies to a single protein or peptide identified asspecific to a tissue type, for example, brain tissue, or antibodypreparations to several antigenically distinct tissue specific antigenscan be used in panels, independently or in mixtures, as required.

Tissue sections and cell suspensions are prepared forimmunohistochemical examination according to common histologicaltechniques. Multiple cryostat sections (about 4 μm, unfixed) of theunknown tissue and known control, are mounted and each slide coveredwith different dilutions of the antibody preparation. Sections of knownand unknown tissues should also be treated with preparations to providea positive control, a negative control, for example, pre-immune sera,and a control for non-specific staining, for example, buffer.

Treated sections are incubated in a humid chamber for 30 min at roomtemperature, rinsed, then washed in buffer for 30-45 min. Excess fluidis blotted away, and the marker developed.

If the tissue specific antibody was not labeled in the first incubation,it can be labeled at this time in a second antibody-antibody reaction,for example, by adding fluorescein- or enzyme-conjugated antibodyagainst the immunoglobulin class of the antiserum-producing species, forexample, fluorescein labeled antibody to mouse IgG. Such labeled seraare commercially available.

The antigen found in the tissues by the above procedure can bequantified by measuring the intensity of color or fluorescence on thetissue section, and calibrating that signal using appropriate standards.

B. Identification of Tissue Specific Soluble Proteins

The visualization of tissue specific proteins and identification ofunknown tissues from that procedure is carried out using the labeledantibody reagents and detection strategy as described forimmunohistochemistry; however the sample is prepared according to anelectrophoretic technique to distribute the proteins extracted from thetissue in an orderly array on the basis of molecular weight fordetection.

A tissue sample is homogenized using a Virtis apparatus; cellsuspensions are disrupted by Dounce homogenization or osmotic lysis,using detergents in either case as required to disrupt cell membranes,as is the practice in the art. Insoluble cell components such as nuclei,microsomes, and membrane fragments are removed by ultracentrifugation,and the soluble protein-containing fraction concentrated if necessaryand reserved for analysis.

A sample of the soluble protein solution is resolved into individualprotein species by conventional SDS polyacrylamide electrophoresis asdescribed, for example, by Davis, L. et al., Section 19-2 in: BasicMethods in Molecular Biology (P. Leder, ed), Elsevier, New York (1986),using a range of amounts of polyacrylamide in a set of gels to resolvethe entire molecular weight range of proteins to be detected in thesample. A size marker is run in parallel for purposes of estimatingmolecular weights of the constituent proteins. Sample size for analysisis a convenient volume of from 5 to 55 μl, and containing from about 1to 100 μg protein. An aliquot of each of the resolved proteins istransferred by blotting to a nitrocellulose filter paper, a process thatmaintains the pattern of resolution. Multiple copies are prepared. Theprocedure, known as Western Blot Analysis, is well described in Davis,L. et al., (above) Section 19-3. One set of nitrocellulose blots isstained with Coomassie Blue dye to visualize the entire set of proteinsfor comparison with the antibody bound proteins. The remainingnitrocellulose filters are then incubated with a solution of one or morespecific antisera to tissue specific proteins prepared as described inExamples 30 and 40. In this procedure, as in procedure A above,appropriate positive and negative sample and reagent controls are run.

In either procedure A or B, a detectable label can be attached to theprimary tissue antigen-primary antibody complex according to variousstrategies and permutations thereof. In a straightforward approach, theprimary specific antibody can be labeled; alternatively, the unlabeledcomplex can be bound by a labeled secondary anti-IgG antibody. In otherapproaches, either the primary or secondary antibody is conjugated to abiotin molecule, which can, in a subsequent step, bind an avidinconjugated marker. According to yet another strategy, enzyme labeled orradioactive protein A, which has the property of binding to any IgG, isbound in a final step to either the primary or secondary antibody.

The visualization of tissue specific antigen binding at levels abovethose seen in control tissues to one or more tissue specific antibodies,prepared from the gene sequences identified from extended cDNAsequences, can identify tissues of unknown origin, for example, forensicsamples, or differentiated tumor tissue that has metastasized to foreignbodily sites.

In addition to their applications in forensics and identification,extended cDNAs (or genomic DNAs obtainable therefrom) may be mapped totheir chromosomal locations. Example 49 below describes radiation hybrid(RH) mapping of human chromosomal regions using extended cDNAs. Example50 below describes a representative procedure for mapping an extendedcDNA (or a genomic DNA obtainable therefrom) to its location on a humanchromosome. Example 51 below describes mapping of extended cDNAs (orgenomic DNAs obtainable therefrom) on metaphase chromosomes byFluorescence In Situ Hybridization (FISH).

EXAMPLE 49 Radiation Hybrid Mapping of Extended cDNAs to the HumanGenome

Radiation hybrid (RH) mapping is a somatic cell genetic approach thatcan be used for high resolution mapping of the human genome. In thisapproach, cell lines containing one or more human chromosomes arelethally irradiated, breaking each chromosome into fragments whose sizedepends on the radiation dose. These fragments are rescued by fusionwith cultured rodent cells, yielding subclones containing differentportions of the human genome. This technique is described by Benham etal. (Genomics 4:509-517, 1989) and Cox et al., (Science 250:245-250,1990), the entire contents of which are hereby incorporated byreference. The random and independent nature of the subclones permitsefficient mapping of any human genome marker. Human DNA isolated from apanel of 80-100 cell lines provides a mapping reagent for orderingextended cDNAs (or genomic DNAs obtainable therefrom). In this approach,the frequency of breakage between markers is used to measure distance,allowing construction of fine resolution maps as has been done usingconventional ESTs (Schuler et al., Science 274:540-546, 1996, herebyincorporated by reference).

RH mapping has been used to generate a high-resolution whole genomeradiation hybrid map of human chromosome 17q22-q25.3 across the genesfor growth hormone (GH) and thymidine kinase (TK) (Foster et al.,Genomics 33:185-192, 1996), the region surrounding the Gorlin syndromegene (Obermayr et al., Eur. J. Hum. Genet. 4:242-245, 1996), 60 locicovering the entire short arm of chromosome 12 (Raeymaekers et al.,Genomics 29:170-178, 1995), the region of human chromosome 22 containingthe neurofibromatosis type 2 locus (Frazer et al., Genomics 14:574-584,1992) and 13 loci on the long arm of chromosome 5 (Warrington et al.,Genomics 11:701-708, 1991).

EXAMPLE 50 Mapping of Extended cDNAs to Human Chromosomes using PCRTechniques

Extended cDNAs (or genomic DNAs obtainable therefrom) may be assigned tohuman chromosomes using PCR based methodologies. In such approaches,oligonucleotide primer pairs are designed from the extended cDNAsequence (or the sequence of a genomic DNA obtainable therefrom) tominimize the chance of amplifying through an intron. Preferably, theoligonucleotide primers are 18-23 bp in length and are designed for PCRamplification. The creation of PCR primers from known sequences is wellknown to those with skill in the art. For a review of PCR technology seeErlich, H. A., PCR Technology; Principles and Applications for DNAAmplification, 1992. W.H. Freeman and Co., New York.

The primers are used in polymerase chain reactions (PCR) to amplifytemplates from total human genomic DNA. PCR conditions are as follows:60 ng of genomic DNA is used as a template for PCR with 80 ng of eacholigonucleotide primer, 0.6 unit of Taq polymerase, and 1 μCu of a³²P-labeled deoxycytidine triphosphate. The PCR is performed in amicroplate thermocycler (Techne) under the following conditions: 30cycles of 94° C., 1.4 min; 55° C., 2 min; and 72° C., 2 min; with afinal extension at 72° C. for 10 min. The amplified products areanalyzed on a 6% polyacrylamide sequencing gel and visualized byautoradiography. If the length of the resulting PCR product is identicalto the distance between the ends of the primer sequences in the extendedcDNA from which the primers are derived, then the PCR reaction isrepeated with DNA templates from two panels of human-rodent somatic cellhybrids, BIOS PCRable DNA (BIOS Corporation) and NIGMS Human-RodentSomatic Cell Hybrid Mapping Panel Number 1 (NIGMS, Camden, N.J.).

PCR is used to screen a series of somatic cell hybrid cell linescontaining defined sets of human chromosomes for the presence of a givenextended cDNA (or genomic DNA obtainable therefrom). DNA is isolatedfrom the somatic hybrids and used as starting templates for PCRreactions using the primer pairs from the extended cDNAs (or genomicDNAs obtainable therefrom). Only those somatic cell hybrids withchromosomes containing the human gene corresponding to the extended cDNA(or genomic DNA obtainable therefrom) will yield an amplified fragment.The extended cDNAs (or genomic DNAs obtainable therefrom) are assignedto a chromosome by analysis of the segregation pattern of PCR productsfrom the somatic hybrid DNA templates. The single human chromosomepresent in all cell hybrids that give rise to an amplified fragment isthe chromosome containing that extended cDNA (or genomic DNA obtainabletherefrom). For a review of techniques and analysis of results fromsomatic cell gene mapping experiments. (See Ledbetter et al., Genomics6:475-481 (1990).)

Alternatively, the extended cDNAs (or genomic DNAs obtainable therefrom)may be mapped to individual chromosomes using FISH as described inExample 51 below.

EXAMPLE 51 Mapping of Extended 5′ ESTs to Chromosomes Using Fluorescencein situ Hybridization

Fluorescence in situ hybridization allows the extended cDNA (or genomicDNA obtainable therefrom) to be mapped to a particular location on agiven chromosome. The chromosomes to be used for fluorescence in situhybridization techniques may be obtained from a variety of sourcesincluding cell cultures, tissues, or whole blood.

In a preferred embodiment, chromosomal localization of an extended cDNA(or genomic DNA obtainable therefrom) is obtained by FISH as describedby Cherif et al. (Proc. Natl. Acad. Sci. U.S.A., 87:6639-6643, 1990).Metaphase chromosomes are prepared from phytohemagglutinin(PHA)-stimulated blood cell donors. PHA-stimulated lymphocytes fromhealthy males are cultured for 72 h in RPMI-1640 medium. Forsynchronization, methotrexate (10 μM) is added for 17 h, followed byaddition of 5-bromodeoxyuridine (5-BudR, 0.1 mM) for 6 h. Colcemid (1μg/ml) is added for the last 15 min before harvesting the cells. Cellsare collected, washed in RPMI, incubated with a hypotonic solution ofKCl (75 mM) at 37° C. for 15 min and fixed in three changes ofmethanol:acetic acid (3:1). The cell suspension is dropped onto a glassslide and air dried. The extended cDNA (or genomic DNA obtainabletherefrom) is labeled with biotin-16 dUTP by nick translation accordingto the manufacturer's instructions (Bethesda Research Laboratories,Bethesda, Md.), purified using a SEPHADEX G-50 column (Pharmacia,Upssala, Sweden) and precipitated. Just prior to hybridization, the DNApellet is dissolved in hybridization buffer (50% formamide, 2×SSC, 10%dextran sulfate, 1 mg/ml sonicated salmon sperm DNA, pH 7) and the probeis denatured at 70° C. for 5-10 min.

Slides kept at −20° C. are treated for 1 h at 37° C. with RNase A (100μg/ml), rinsed three times in 2×SSC and dehydrated in an ethanol series.Chromosome preparations are denatured in 70% formamide, 2×SSC for 2 minat 70° C., then dehydrated at 4° C. The slides are treated withproteinase K (10 μg/100 ml in 20 mM Tris-HCl, 2 mM CaCl₂) at 37° C. for8 min and dehydrated. The hybridization mixture containing the probe isplaced on the slide, covered with a coverslip, sealed with rubber cementand incubated overnight in a humid chamber at 37° C. After hybridizationand post-hybridization washes, the biotinylated probe is detected byavidin-FITC and amplified with additional layers of biotinylated goatanti-avidin and avidin-FITC. For chromosomal localization, fluorescentR-bands are obtained as previously described (Cherif et al., supra). Theslides are observed under a LEICA fluorescence microscope (DMRXA).Chromosomes are counterstained with propidium iodide and the fluorescentsignal of the probe appears as two symmetrical yellow-green spots onboth chromatids of the fluorescent R-band chromosome (red). Thus, aparticular extended cDNA (or genomic DNA obtainable therefrom) may belocalized to a particular cytogenetic R-band on a given chromosome.

Once the extended cDNAs (or genomic DNAs obtainable therefrom) have beenassigned to particular chromosomes using the techniques described inExamples 49-51 above, they may be utilized to construct a highresolution map of the chromosomes on which they are located or toidentify the chromosomes in a sample.

EXAMPLE 52 Use of Extended cDNAs to Construct or Expand Chromosome Maps

Chromosome mapping involves assigning a given unique sequence to aparticular chromosome as described above. Once the unique sequence hasbeen mapped to a given chromosome, it is ordered relative to otherunique sequences located on the same chromosome. One approach tochromosome mapping utilizes a series of yeast artificial chromosomes(YACs) bearing several thousand long inserts derived from thechromosomes of the organism from which the extended cDNAs (or genomicDNAs obtainable therefrom) are obtained. This approach is described inRamaiah Nagaraja et al. Genome Research 7:210-222, March 1997. Briefly,in this approach each chromosome is broken into overlapping pieces whichare inserted into the YAC vector. The YAC inserts are screened using PCRor other methods to determine whether they include the extended cDNA (orgenomic DNA obtainable therefrom) whose position is to be determined.Once an insert has been found which includes the extended cDNA (orgenomic DNA obtainable therefrom), the insert can be analyzed by PCR orother methods to determine whether the insert also contains othersequences known to be on the chromosome or in the region from which theextended cDNA (or genomic DNA obtainable therefrom) was derived. Thisprocess can be repeated for each insert in the YAC library to determinethe location of each of the extended cDNAs (or genomic DNAs obtainabletherefrom) relative to one another and to other known chromosomalmarkers. In this way, a high resolution map of the distribution ofnumerous unique markers along each of the organisms chromosomes may beobtained.

As described in Example 53 below extended cDNAs (or genomic DNAsobtainable therefrom) may also be used to identify genes associated witha particular phenotype, such as hereditary disease or drug response.

EXAMPLE 53 Identification of Genes Associated with Hereditary Diseasesor Drug Response

This example illustrates an approach useful for the association ofextended cDNAs (or genomic DNAs obtainable therefrom) with particularphenotypic characteristics. In this example, a particular extended cDNA(or genomic DNA obtainable therefrom) is used as a test probe toassociate that extended cDNA (or genomic DNA obtainable therefrom) witha particular phenotypic characteristic.

Extended cDNAs (or genomic DNAs obtainable therefrom) are mapped to aparticular location on a human chromosome using techniques such as thosedescribed in Examples 49 and 50 or other techniques known in the art. Asearch of Mendelian Inheritance in Man (V. McKusick, MendelianInheritance in Man (available on line through Johns Hopkins UniversityWelch Medical Library) reveals the region of the human chromosome whichcontains the extended cDNA (or genomic DNA obtainable therefrom) to be avery gene rich region containing several known genes and severaldiseases or phenotypes for which genes have not been identified. Thegene corresponding to this extended cDNA (or genomic DNA obtainabletherefrom) thus becomes an immediate candidate for each of these geneticdiseases.

Cells from patients with these diseases or phenotypes are isolated andexpanded in culture. PCR primers from the extended cDNA (or genomic DNAobtainable therefrom) are used to screen genomic DNA, mRNA or cDNAobtained from the patients. Extended cDNAs (or genomic DNAs obtainabletherefrom) that are not amplified in the patients can be positivelyassociated with a particular disease by further analysis. Alternatively,the PCR analysis may yield fragments of different lengths when thesamples are derived from an individual having the phenotype associatedwith the disease than when the sample is derived from a healthyindividual, indicating that the gene containing the extended cDNA may beresponsible for the genetic disease.

VI. Use of Extended cDNAs (or Genomic DNAs Obtainable Therefrom) toConstruct Vectors

The present extended cDNAs (or genomic DNAs obtainable therefrom) mayalso be used to construct secretion vectors capable of directing thesecretion of the proteins encoded by genes inserted in the vectors. Suchsecretion vectors may facilitate the purification or enrichment of theproteins encoded by genes inserted therein by reducing the number ofbackground proteins from which the desired protein must be purified orenriched. Exemplary secretion vectors are described in Example 54 below.

EXAMPLE 54 Construction of Secretion Vectors

The secretion vectors of the present invention include a promotercapable of directing gene expression in the host cell, tissue, ororganism of interest. Such promoters include the Rous Sarcoma Viruspromoter, the SV40 promoter, the human cytomegalovirus promoter, andother promoters familiar to those skilled in the art.

A signal sequence from an extended cDNA (or genomic DNA obtainabletherefrom), such as one of the signal sequences in SEQ ID NOs: 40-59,61-73, 75-82, 84, and 130-154 as defined in Table IV above, is operablylinked to the promoter such that the mRNA transcribed from the promoterwill direct the translation of the signal peptide. The host cell,tissue, or organism may be any cell, tissue, or organism whichrecognizes the signal peptide encoded by the signal sequence in theextended cDNA (or genomic DNA obtainable therefrom). Suitable hostsinclude mammalian cells, tissues or organisms, avian cells, tissues, ororganisms, insect cells, tissues or organisms, or yeast.

In addition, the secretion vector contains cloning sites for insertinggenes encoding the proteins which are to be secreted. The cloning sitesfacilitate the cloning of the insert gene in frame with the signalsequence such that a fusion protein in which the signal peptide is fusedto the protein encoded by the inserted gene is expressed from the mRNAtranscribed from the promoter. The signal peptide directs theextracellular secretion of the fusion protein.

The secretion vector may be DNA or RNA and may integrate into thechromosome of the host, be stably maintained as an extrachromosomalreplicon in the host, be an artificial chromosome, or be transientlypresent in the host. Many nucleic acid backbones suitable for use assecretion vectors are known to those skilled in the art, includingretroviral vectors, SV40 vectors, Bovine Papilloma Virus vectors, yeastintegrating plasmids, yeast episomal plasmids, yeast artificialchromosomes, human artificial chromosomes, P element vectors,baculovirus vectors, or bacterial plasmids capable of being transientlyintroduced into the host.

The secretion vector may also contain a polyA signal such that the polyAsignal is located downstream of the gene inserted into the secretionvector.

After the gene encoding the protein for which secretion is desired isinserted into the secretion vector, the secretion vector is introducedinto the host cell, tissue, or organism using calcium phosphateprecipitation, DEAE-Dextran, electroporation, liposome-mediatedtransfection, viral particles or as naked DNA. The protein encoded bythe inserted gene is then purified or enriched from the supernatantusing conventional techniques such as ammonium sulfate precipitation,immunoprecipitation, immunochromatography, size exclusionchromatography, ion exchange chromatography, and hplc. Alternatively,the secreted protein may be in a sufficiently enriched or pure state inthe supernatant or growth media of the host to permit it to be used forits intended purpose without further enrichment.

The signal sequences may also be inserted into vectors designed for genetherapy. In such vectors, the signal sequence is operably linked to apromoter such that mRNA transcribed from the promoter encodes the signalpeptide. A cloning site is located downstream of the signal sequencesuch that a gene encoding a protein whose secretion is desired mayreadily be inserted into the vector and fused to the signal sequence.The vector is introduced into an appropriate host cell. The proteinexpressed from the promoter is secreted extracellularly, therebyproducing a therapeutic effect.

The extended cDNAs or 5′ ESTs may also be used to clone sequenceslocated upstream of the extended cDNAs or 5′ ESTs which are capable ofregulating gene expression, including promoter sequences, enhancersequences, and other upstream sequences which influence transcription ortranslation levels. Once identified and cloned, these upstreamregulatory sequences may be used in expression vectors designed todirect the expression of an inserted gene in a desired spatial,temporal, developmental, or quantitative fashion. Example 55 describes amethod for cloning sequences upstream of the extended cDNAs or 5′ ESTs.

EXAMPLE 55 Use of Extended cDNAs or 5′ ESTs to Clone Upstream Sequencesfrom Genomic DNA

Sequences derived from extended cDNAs or 5′ ESTs may be used to isolatethe promoters of the corresponding genes using chromosome walkingtechniques. In one chromosome walking technique, which utilizes theGenomeWalker™ kit available from Clontech, five complete genomic DNAsamples are each digested with a different restriction enzyme which hasa 6 base recognition site and leaves a blunt end. Following digestion,oligonucleotide adapters are ligated to each end of the resultinggenomic DNA fragments.

For each of the five genomic DNA libraries, a first PCR reaction isperformed according to the manufacturer's instructions (which areincorporated herein by reference) using an outer adaptor primer providedin the kit and an outer gene specific primer. The gene specific primershould be selected to be specific for the extended cDNA or ′ EST ofinterest and should have a melting temperature, length, and location inthe extended cDNA or 5′ EST which is consistent with its use in PCRreactions. Each first PCR reaction contains 5 ng of genomic DNA, 5 μl of10× Tth reaction buffer, 0.2 mM of each dNTP, 0.2 μM each of outeradaptor primer and outer gene specific primer, 1.1 mM of Mg(OAc)₂, and 1μl of the Tth polymerase 50× mix in a total volume of 50 μl. Thereaction cycle for the first PCR reaction is as follows: 1 min@ 94° C./2sec (94° C., 3 min@72° C. (7 cycles)/2 sec@94° C., 3 min@67° C. (32cycles)/5 min@ 67° C.

The product of the first PCR reaction is diluted and used as a templatefor a second PCR reaction according to the manufacturer's instructionsusing a pair of nested primers which are located internally on theamplicon resulting from the first PCR reaction. For example, 5 μl of thereaction product of the first PCR reaction mixture may be diluted 180times. Reactions are made in a 50 μl volume having a compositionidentical to that of the first PCR reaction except the nested primersare used. The first nested primer is specific for the adaptor, and isprovided with the GenomeWalker™ kit. The second nested primer isspecific for the particular extended cDNA or 5′ EST for which thepromoter is to be cloned and should have a melting temperature, length,and location in the extended cDNA or 5′ EST which is consistent with itsuse in PCR reactions. The reaction parameters of the second PCR reactionare as follows: 1 min@ 94° C./2 sec@94° C., 3 min@72° C. (6 cycles)/2sec@94° C., 3 min@ 67° C. (25 cycles)/5 min@67° C.

The product of the second PCR reaction is purified, cloned, andsequenced using standard techniques. Alternatively, two or more humangenomic DNA libraries can be constructed by using two or morerestriction enzymes. The digested genomic DNA is cloned into vectorswhich can be converted into single stranded, circular, or linear DNA. Abiotinylated oligonucleotide comprising at least 15 nucleotides from theextended cDNA or 5′ EST sequence is hybridized to the single strandedDNA. Hybrids between the biotinylated oligonucleotide and the singlestranded DNA containing the extended cDNA or EST sequence are isolatedas described in Example 29 above. Thereafter, the single stranded DNAcontaining the extended cDNA or EST sequence is released from the beadsand converted into double stranded DNA using a primer specific for theextended cDNA or 5′ EST sequence or a primer corresponding to a sequenceincluded in the cloning vector. The resulting double stranded DNA istransformed into bacteria. DNAs containing the 5′ EST or extended cDNAsequences are identified by colony PCR or colony hybridization.

Once the upstream genomic sequences have been cloned and sequenced asdescribed above, prospective promoters and transcription start siteswithin the upstream sequences may be identified by comparing thesequences upstream of the extended cDNAs or 5′ ESTs with databasescontaining known transcription start sites, transcription factor bindingsites, or promoter sequences.

In addition, promoters in the upstream sequences may be identified usingpromoter reporter vectors as described in Example 56.

EXAMPLE 56 Identification of Promoters in Cloned Upstream Sequences

The genomic sequences upstream of the extended cDNAs or 5′ ESTs arecloned into a suitable promoter reporter vector, such as thepSEAP-Basic, pSEAP-Enhancer, pβgal-Basic, pβgal-Enhancer, or pEGFP-1Promoter Reporter vectors available from Clontech. Briefly, each ofthese promoter reporter vectors include multiple cloning sitespositioned upstream of a reporter gene encoding a readily assayableprotein such as secreted alkaline phosphatase, P galactosidase, or greenfluorescent protein. The sequences upstream of the extended cDNAs or 5′ESTs are inserted into the cloning sites upstream of the reporter genein both orientations and introduced into an appropriate host cell. Thelevel of reporter protein is assayed and compared to the level obtainedfrom a vector which lacks an insert in the cloning site. The presence ofan elevated expression level in the vector containing the insert withrespect to the control vector indicates the presence of a promoter inthe insert. If necessary, the upstream sequences can be cloned intovectors which contain an enhancer for augmenting transcription levelsfrom weak promoter sequences. A significant level of expression abovethat observed with the vector lacking an insert indicates that apromoter sequence is present in the inserted upstream sequence.

Appropriate host cells for the promoter reporter vectors may be chosenbased on the results of the above described determination of expressionpatterns of the extended cDNAs and ESTs. For example, if the expressionpattern analysis indicates that the mRNA corresponding to a particularextended cDNA or 5′ EST is expressed in fibroblasts, the promoterreporter vector may be introduced into a human fibroblast cell line.

Promoter sequences within the upstream genomic DNA may be furtherdefined by constructing nested deletions in the upstream DNA usingconventional techniques such as Exonuclease III digestion. The resultingdeletion fragments can be inserted into the promoter reporter vector todetermine whether the deletion has reduced or obliterated promoteractivity. In this way, the boundaries of the promoters may be defined.If desired, potential individual regulatory sites within the promotermay be identified using site directed mutagenesis or linker scanning toobliterate potential transcription factor binding sites within thepromoter individually or in combination. The effects of these mutationson transcription levels may be determined by inserting the mutationsinto the cloning sites in the promoter reporter vectors.

EXAMPLE 57 Cloning and Identification of Promoters

Using the method described in Example 55 above with 5′ ESTs, sequencesupstream of several genes were obtained. Using the primer pairs GGG AAGATG GAG ATA GTA TTG CCT G (SEQ ID NO:29) and CTG CCA TGT ACA TGA TAG AGAGAT TC (SEQ ID NO:30), the promoter having the internal designationP13H2 (SEQ ID NO:31) was obtained.

Using the primer pairs GTA CCA GGGG ACT GTG ACC ATT GC (SEQ ID NO:32)and CTG TGA CCA TTG CTC CCA AGA GAG (SEQ ID NO:33), the promoter havingthe internal designation P15B4 (SEQ ID NO:34) was obtained.

Using the primer pairs CTG GGA TGG AAG GCA CGG TA (SEQ ID NO:35) and GAGACC ACA CAG CTA GAC AA (SEQ ID NO:36), the promoter having the internaldesignation P29B6 (SEQ ID NO:37) was obtained.

FIG. 8 provides a schematic description of the promoters isolated andthe way they are assembled with the corresponding 5′ tags. The upstreamsequences were screened for the presence of motifs resemblingtranscription factor binding sites or known transcription start sitesusing the computer program MatInspector release 2.0, August 1996.

FIG. 9 describes the transcription factor binding sites present in eachof these promoters. The columns labeled matrice provides the name of theMatInspector matrix used. The column labeled position provides the 5′position of the promoter site. Numeration of the sequence starts fromthe transcription site as determined by matching the genomic sequencewith the 5′ EST sequence. The column labeled “orientation” indicates theDNA strand on which the site is found, with the +strand being the codingstrand as determined by matching the genomic sequence with the sequenceof the 5′ EST. The column labeled “score” provides the MatInspectorscore found for this site. The column labeled “length” provides thelength of the site in nucleotides. The column labeled “sequence”provides the sequence of the site found.

The promoters and other regulatory sequences located upstream of theextended cDNAs or 5′ ESTs may be used to design expression vectorscapable of directing the expression of an inserted gene in a desiredspatial, temporal, developmental, or quantitative manner. A promotercapable of directing the desired spatial, temporal, developmental, andquantitative patterns may be selected using the results of theexpression analysis described in Example 26 above. For example, if apromoter which confers a high level of expression in muscle is desired,the promoter sequence upstream of an extended cDNA or 5′ EST derivedfrom an mRNA which is expressed at a high level in muscle, as determinedby the method of Example 26, may be used in the expression vector.

Preferably, the desired promoter is placed near multiple restrictionsites to facilitate the cloning of the desired insert downstream of thepromoter, such that the promoter is able to drive expression of theinserted gene. The promoter may be inserted in conventional nucleic acidbackbones designed for extrachromosomal replication, integration intothe host chromosomes or transient expression. Suitable backbones for thepresent expression vectors include retroviral backbones, backbones fromeukaryotic episomes such as SV40 or Bovine Papilloma Virus, backbonesfrom bacterial episomes, or artificial chromosomes.

Preferably, the expression vectors also include a polyA signaldownstream of the multiple restriction sites for directing thepolyadenylation of mRNA transcribed from the gene inserted into theexpression vector.

Following the identification of promoter sequences using the proceduresof Examples 55-57, proteins which interact with the promoter may beidentified as described in Example 58 below.

EXAMPLE 58 Identification of Proteins Which Interact with PromoterSequences, Upstream Regulatory Sequences, or mRNA

Sequences within the promoter region which are likely to bindtranscription factors may be identified by homology to knowntranscription factor binding sites or through conventional mutagenesisor deletion analyses of reporter plasmids containing the promotersequence. For example, deletions may be made in a reporter plasmidcontaining the promoter sequence of interest operably linked to anassayable reporter gene. The reporter plasmids carrying variousdeletions within the promoter region are transfected into an appropriatehost cell and the effects of the deletions on expression levels isassessed. Transcription factor binding sites within the regions in whichdeletions reduce expression levels may be further localized using sitedirected mutagenesis, linker scanning analysis, or other techniquesfamiliar to those skilled in the art. Nucleic acids encoding proteinswhich interact with sequences in the promoter may be identified usingone-hybrid systems such as those described in the manual accompanyingthe Matchmaker One-Hybrid System kit available from Clontech (CatalogNo. K1603-1), the disclosure of which is incorporated herein byreference. Briefly, the Matchmaker One-hybrid system is used as follows.The target sequence for which it is desired to identify binding proteinsis cloned upstream of a selectable reporter gene and integrated into theyeast genome. Preferably, multiple copies of the target sequences areinserted into the reporter plasmid in tandem.

A library comprised of fusions between cDNAs to be evaluated for theability to bind to the promoter and the activation domain of a yeasttranscription factor, such as GAL4, is transformed into the yeast straincontaining the integrated reporter sequence. The yeast are plated onselective media to select cells expressing the selectable marker linkedto the promoter sequence. The colonies which grow on the selective mediacontain genes encoding proteins which bind the target sequence. Theinserts in the genes encoding the fusion proteins are furthercharacterized by sequencing. In addition, the inserts may be insertedinto expression vectors or in vitro transcription vectors. Binding ofthe polypeptides encoded by the inserts to the promoter DNA may beconfirmed by techniques familiar to those skilled in the art, such asgel shift analysis or DNAse protection analysis.

VII. Use of Extended cDNAs (or Genomic DNAs Obtainable Therefrom) inGene Therapy

The present invention also comprises the use of extended cDNAs (orgenomic DNAs obtainable therefrom) in gene therapy strategies, includingantisense and triple helix strategies as described in Examples 57 and 58below. In antisense approaches, nucleic acid sequences complementary toan mRNA are hybridized to the mRNA intracellularly, thereby blocking theexpression of the protein encoded by the mRNA. The antisense sequencesmay prevent gene expression through a variety of mechanisms. Forexample, the antisense sequences may inhibit the ability of ribosomes totranslate the mRNA. Alternatively, the antisense sequences may blocktransport of the mRNA from the nucleus to the cytoplasm, therebylimiting the amount of mRNA available for translation. Another mechanismthrough which antisense sequences may inhibit gene expression is byinterfering with mRNA splicing. In yet another strategy, the antisensenucleic acid may be incorporated in a ribozyme capable of specificallycleaving the target mRNA.

EXAMPLE 59 Preparation and Use of Antisense Oligonucleotides

The antisense nucleic acid molecules to be used in gene therapy may beeither DNA or RNA sequences. They may comprise a sequence complementaryto the sequence of the extended cDNA (or genomic DNA obtainabletherefrom). The antisense nucleic acids should have a length and meltingtemperature sufficient to permit formation of an intracellular duplexhaving sufficient stability to inhibit the expression of the mRNA in theduplex. Strategies for designing antisense nucleic acids suitable foruse in gene therapy are disclosed in Green et al., Ann. Rev. Biochem.55:569-597 (1986) and Izant and Weintraub, Cell 36:1007-1015 (1984),which are hereby incorporated by reference.

In some strategies, antisense molecules are obtained from a nucleotidesequence encoding a protein by reversing the orientation of the codingregion with respect to a promoter so as to transcribe the oppositestrand from that which is normally transcribed in the cell. Theantisense molecules may be transcribed using in vitro transcriptionsystems such as those which employ T7 or SP6 polymerase to generate thetranscript. Another approach involves transcription of the antisensenucleic acids in vivo by operably linking DNA containing the antisensesequence to a promoter in an expression vector.

Alternatively, oligonucleotides which are complementary to the strandnormally transcribed in the cell may be synthesized in vitro. Thus, theantisense nucleic acids are complementary to the corresponding mRNA andare capable of hybridizing to the mRNA to create a duplex. In someembodiments, the antisense sequences may contain modified sugarphosphate backbones to increase stability and make them less sensitiveto RNase activity. Examples of modifications suitable for use inantisense strategies are described by Rossi et al., Pharmacol. Ther.50(2):245-254, (1991).

Various types of antisense oligonucleotides complementary to thesequence of the extended cDNA (or genomic DNA obtainable therefrom) maybe used. In one preferred embodiment, stable and semi-stable antisenseoligonucleotides described in International Application No. PCTWO94/23026, hereby incorporated by reference, are used. In thesemolecules, the 3′ end or both the 3′ and 5′ ends are engaged inintramolecular hydrogen bonding between complementary base pairs. Thesemolecules are better able to withstand exonuclease attacks and exhibitincreased stability compared to conventional antisense oligonucleotides.

In another preferred embodiment, the antisense oligodeoxynucleotidesagainst herpes simplex virus types 1 and 2 described in InternationalApplication No. WO 95/04141, hereby incorporated by reference, are used.

In yet another preferred embodiment, the covalently cross-linkedantisense oligonucleotides described in International Application No. WO96/31523, hereby incorporated by reference, are used. These double- orsingle-stranded oligonucleotides comprise one or more, respectively,inter- or intra-oligonucleotide covalent cross-linkages, wherein thelinkage consists of an amide bond between a primary amine group of onestrand and a carboxyl group of the other strand or of the same strand,respectively, the primary amine group being directly substituted in the2′ position of the strand nucleotide monosaccharide ring, and thecarboxyl group being carried by an aliphatic spacer group substituted ona nucleotide or nucleotide analog of the other strand or the samestrand, respectively.

The antisense oligodeoxynucleotides and oligonucleotides disclosed inInternational Application No. WO 92/18522, incorporated by reference,may also be used. These molecules are stable to degradation and containat least one transcription control recognition sequence which binds tocontrol proteins and are effective as decoys therefor. These moleculesmay contain “hairpin” structures, “dumbbell” structures, “modifieddumbbell” structures, “cross-linked” decoy structures and “loop”structures.

In another preferred embodiment, the cyclic double-strandedoligonucleotides described in European Patent Application No. 0 572 287A2, hereby incorporated by reference are used. These ligatedoligonucleotide “dumbbells” contain the binding site for a transcriptionfactor and inhibit expression of the gene under control of thetranscription factor by sequestering the factor.

Use of the closed antisense oligonucleotides disclosed in InternationalApplication No. WO 92/19732, hereby incorporated by reference, is alsocontemplated. Because these molecules have no free ends, they are moreresistant to degradation by exonucleases than are conventionaloligonucleotides. These oligonucleotides may be multifunctional,interacting with several regions which are not adjacent to the targetmRNA.

The appropriate level of antisense nucleic acids required to inhibitgene expression may be determined using in vitro expression analysis.The antisense molecule may be introduced into the cells by diffusion,injection, infection or transfection using procedures known in the art.For example, the antisense nucleic acids can be introduced into the bodyas a bare or naked oligonucleotide, oligonucleotide encapsulated inlipid, oligonucleotide sequence encapsulated by viral protein, or as anoligonucleotide operably linked to a promoter contained in an expressionvector. The expression vector may be any of a variety of expressionvectors known in the art, including retroviral or viral vectors, vectorscapable of extrachromosomal replication, or integrating vectors. Thevectors may be DNA or RNA.

The antisense molecules are introduced onto cell samples at a number ofdifferent concentrations preferably between 1×10⁻¹⁰ M to 1×10⁻⁴ M. Oncethe minimum concentration that can adequately control gene expression isidentified, the optimized dose is translated into a dosage suitable foruse in vivo. For example, an inhibiting concentration in culture of1×10⁻⁷ translates into a dose of approximately 0.6 mg/kg bodyweight.Levels of oligonucleotide approaching 100 mg/kg bodyweight or higher maybe possible after testing the toxicity of the oligonucleotide inlaboratory animals. It is additionally contemplated that cells from thevertebrate are removed, treated with the antisense oligonucleotide, andreintroduced into the vertebrate.

It is further contemplated that the antisense oligonucleotide sequenceis incorporated into a ribozyme sequence to enable the antisense tospecifically bind and cleave its target mRNA. For technical applicationsof ribozyme and antisense oligonucleotides see Rossi et al., supra.

In a preferred application of this invention, the polypeptide encoded bythe gene is first identified, so that the effectiveness of antisenseinhibition on translation can be monitored using techniques that includebut are not limited to antibody-mediated tests such as RIAs and ELISA,functional assays, or radiolabeling.

The extended cDNAs of the present invention (or genomic DNAs obtainabletherefrom) may also be used in gene therapy approaches based onintracellular triple helix formation. Triple helix oligonucleotides areused to inhibit transcription from a genome. They are particularlyuseful for studying alterations in cell activity as it is associatedwith a particular gene. The extended cDNAs (or genomic DNAs obtainabletherefrom) of the present invention or, more preferably, a portion ofthose sequences, can be used to inhibit gene expression in individualshaving diseases associated with expression of a particular gene.Similarly, a portion of the extended cDNA (or genomic DNA obtainabletherefrom) can be used to study the effect of inhibiting transcriptionof a particular gene within a cell. Traditionally, homopurine sequenceswere considered the most useful for triple helix strategies. However,homopyrimidine sequences can also inhibit gene expression. Suchhomopyrimidine oligonucleotides bind to the major groove athomopurine:homopyrimidine sequences. Thus, both types of sequences fromthe extended cDNA or from the gene corresponding to the extended cDNAare contemplated within the scope of this invention.

EXAMPLE 60 Preparation and use of Triple Helix Probes

The sequences of the extended cDNAs (or genomic DNAs obtainabletherefrom) are scanned to identify 10-mer to 20-mer homopyrimidine orhomopurine stretches which could be used in triple-helix basedstrategies for inhibiting gene expression. Following identification ofcandidate homopyrimidine or homopurine stretches, their efficiency ininhibiting gene expression is assessed by introducing varying amounts ofoligonucleotides containing the candidate sequences into tissue culturecells which normally express the target gene. The oligonucleotides maybe prepared on an oligonucleotide synthesizer or they may be purchasedcommercially from a company specializing in custom oligonucleotidesynthesis, such as GENSET, Paris, France.

The oligonucleotides may be introduced into the cells using a variety ofmethods known to those skilled in the art, including but not limited tocalcium phosphate precipitation, DEAE-Dextran, electroporation,liposome-mediated transfection or native uptake.

Treated cells are monitored for altered cell function or reduced geneexpression using techniques such as Northern blotting, RNase protectionassays, or PCR based strategies to monitor the transcription levels ofthe target gene in cells which have been treated with theoligonucleotide. The cell functions to be monitored are predicted basedupon the homologies of the target gene corresponding to the extendedcDNA from which the oligonucleotide was derived with known genesequences that have been associated with a particular function. The cellfunctions can also be predicted based on the presence of abnormalphysiologies within cells derived from individuals with a particularinherited disease, particularly when the extended cDNA is associatedwith the disease using techniques described in Example 53.

The oligonucleotides which are effective in inhibiting gene expressionin tissue culture cells may then be introduced in vivo using thetechniques described above and in Example 59 at a dosage calculatedbased on the in vitro results, as described in Example 59.

In some embodiments, the natural (beta) anomers of the oligonucleotideunits can be replaced with alpha anomers to render the oligonucleotidemore resistant to nucleases. Further, an intercalating agent such asethidium bromide, or the like, can be attached to the 3′ end of thealpha oligonucleotide to stabilize the triple helix. For information onthe generation of oligonucleotides suitable for triple helix formationsee Griffin et al. (Science 245:967-971 (1989), which is herebyincorporated by this reference).

EXAMPLE 61 Use of Extended cDNAs to Express an Encoded Protein in a HostOrganism

The extended cDNAs of the present invention may also be used to expressan encoded protein in a host organism to produce a beneficial effect. Insuch procedures, the encoded protein may be transiently expressed in thehost organism or stably expressed in the host organism. The encodedprotein may have any of the activities described above. The encodedprotein may be a protein which the host organism lacks or,alternatively, the encoded protein may augment the existing levels ofthe protein in the host organism.

A full length extended cDNA encoding the signal peptide and the matureprotein, or an extended cDNA encoding only the mature protein isintroduced into the host organism. The extended cDNA may be introducedinto the host organism using a variety of techniques known to those ofskill in the art. For example, the extended cDNA may be injected intothe host organism as naked DNA such that the encoded protein isexpressed in the host organism, thereby producing a beneficial effect.

Alternatively, the extended cDNA may be cloned into an expression vectordownstream of a promoter which is active in the host organism. Theexpression vector may be any of the expression vectors designed for usein gene therapy, including viral or retroviral vectors.

The expression vector may be directly introduced into the host organismsuch that the encoded protein is expressed in the host organism toproduce a beneficial effect. In another approach, the expression vectormay be introduced into cells in vitro. Cells containing the expressionvector are thereafter selected and introduced into the host organism,where they express the encoded protein to produce a beneficial effect.

EXAMPLE 62 Use Of Signal Peptides Encoded By 5′ Ests Or SequencesObtained Therefrom To Import Proteins Into Cells

The short core hydrophobic region (h) of signal peptides encoded by the5′ESTS or extended cDNAs derived from the 5′ESTs of the presentinvention may also be used as a carrier to import a peptide or a proteinof interest, so-called cargo, into tissue culture cells (Lin et al., J.Biol. Chem., 270: 14225-14258 (1995); Du et al., J. Peptide Res., 51:235-243 (1998); Rojas et al., Nature Biotech., 16: 370-375 (1998)).

When cell permeable peptides of limited size (approximately up to 25amino acids) are to be translocated across cell membrane, chemicalsynthesis may be used in order to add the h region to either theC-terminus or the N-terminus to the cargo peptide of interest.Alternatively, when longer peptides or proteins are to be imported intocells, nucleic acids can be genetically engineered, using techniquesfamiliar to those skilled in the art, in order to link the extended cDNAsequence encoding the h region to the 5′ or the 3′ end of a DNA sequencecoding for a cargo polypeptide. Such genetically engineered nucleicacids are then translated either in vitro or in vivo after transfectioninto appropriate cells, using conventional techniques to produce theresulting cell permeable polypeptide. Suitable hosts cells are thensimply incubated with the cell permeable polypeptide which is thentranslocated across the membrane.

This method may be applied to study diverse intracellular functions andcellular processes. For instance, it has been used to probe functionallyrelevant domains of intracellular proteins and to examineprotein-protein interactions involved in signal transduction pathways(Lin et al., supra; Lin et al., J. Biol. Chem., 271: 5305-5308 (1996);Rojas et al., J. Biol. Chem., 271: 27456-27461 (1996); Liu et al., Proc.Natl. Acad. Sci. USA, 93: 11819-11824 (1996); Rojas et al., Bioch.Biophys. Res. Commun., 234: 675-680 (1997)).

Such techniques may be used in cellular therapy to import proteinsproducing therapeutic effects. For instance, cells isolated from apatient may be treated with imported therapeutic proteins and thenre-introduced into the host organism.

Alternatively, the h region of signal peptides of the present inventioncould be used in combination with a nuclear localization signal todeliver nucleic acids into cell nucleus. Such oligonucleotides may beantisense oligonucleotides or oligonucleotides designed to form triplehelixes, as described in examples 59 and 60 respectively, in order toinhibit processing and maturation of a target cellular RNA.

EXAMPLE 63 Reassembling & Resequencing of Clones

Full length cDNA clones obtained by the procedure described in Example27 were double-sequenced. These sequences were assembled and theresulting consensus sequences were then reanalyzed. Open reading frameswere reassigned following essentially the same process as the onedescribed in Example 27.

After this reanalysis process a few abnormalities were revealed. Thesequence presented in SEQ ID NO: 84 is apparently unlikely to be genuinefull length cDNAs. This clone is more probably a 3′ truncated cDNAsequence based on homology studies with existing protein sequences.Similarly, the sequences presented in SEQ ID NOs: 60, 76, 83 and 84 mayalso not be genuine full length cDNAs based on homology studies withexisting protein sequences. Although these sequences encode a potentialstart methionine, except for SEQ ID NO:60, they could represent a 5′truncated cDNA.

Finally, after the reassignment of open reading frames for the clones,new open reading frames were chosen in some instances. For example, inthe case of SEQ ID NOs: 60, 74 and 83 the new open reading frames wereno longer predicted to contain a signal peptide.

As discussed above, Table IV provides the sequence identificationnumbers of the extended cDNAs of the present invention, the locations ofthe full coding sequences in SEQ ID NOs: 40-84 and 130-154 (i.e. thenucleotides encoding both the signal peptide and the mature protein,listed under the heading FCS location in Table IV), the locations of thenucleotides in SEQ ID NOs: 40-84 and 130-154 which encode the signalpeptides (listed under the heading SigPep Location in Table IV), thelocations of the nucleotides in SEQ ID NOs: 40-84 and 130-154 whichencode the mature proteins generated by cleavage of the signal peptides(listed under the heading Mature Polypeptide Location in Table IV), thelocations in SEQ ID NOs: 40-84 and 130-154 of stop codons (listed underthe heading Stop Codon Location in Table IV) the locations in SEQ IDNOs: 40-84 and 130-154 of polyA signals (listed under the heading gPolyA Signal Location in Table IV) and the locations of polyA sites(listed under the heading PolyA Site Location in Table IV).

As discussed above, Table V lists the sequence identification numbers ofthe polypeptides of SEQ ID NOs: 85-129 and 155-179, the locations of theamino acid residues of SEQ ID NOs: 85-129 and 155-179 in the full lengthpolypeptide (second column), the locations of the amino acid residues ofSEQ ID NOs: 85-129 and 155-179 in the signal peptides (third column),and the locations of the amino acid residues of SEQ ID NOs: 85-129 and155-179 in the mature polypeptide created by cleaving the signal peptidefrom the full length polypeptide (fourth column). In Table V, and in theappended sequence listing, the first amino acid of the mature proteinresulting from cleavage of the signal peptide is designated as aminoacid number 1 and the first amino acid of the signal peptide isdesignated with the appropriate negative number, in accordance with theregulations governing sequence listings.

EXAMPLE 64 Functional Analysis of Predicted Protein Sequences

Following double-sequencing, new contigs were assembled for each of theextended cDNAs of the present invention and each was compared to knownsequences available at the time of filing. These sequences originatefrom the following databases: Genbank (release 108 and daily releases upto Oct., 15, 1998), Genseq (release 32) PIR (release 53) and Swissprot(release 35). The predicted proteins of the present invention matchingknown proteins were further classified into 3 categories depending onthe level of homology.

The first category contains proteins of the present invention exhibitingmore than 80% identical amino acid residues on the whole length of thematched protein. They are clearly close homologues which most probablyhave the same function or a very similar function as the matchedprotein.

The second category contains proteins of the present inventionexhibiting more remote homologies (30 to 80% over the whole protein)indicating that the protein of the present invention is susceptible tohave a function similar to the one of the matched protein.

The third category contains proteins exhibiting either high homology (90to 100%) to a short domain or more remote homology (40 to 60%) to alarger domain of a known protein indicating that the matched protein andthe protein of the invention may share similar features.

It should be noted that the numbering of amino acids in the proteinsequences discussed in FIGS. 10 to 12, and Table VIII, the firstmethionine encountered is designated as amino acid number 1. In theappended sequence listing, the first amino acid of the mature proteinresulting from cleavage of the signal peptide is designated as aminoacid number 1 and the first amino acid of the signal peptide isdesignated with the appropriate negative number, in accordance with theregulations governing sequence listings.

In addition, all of the corrected amino acid sequences (SEQ ID NOs:85-129 and 155-179) were scanned for the presence of known proteinsignatures and motifs. This search was performed against the Prosite15.0 database, using the Proscan software from the GCG package.Functional signatures and their locations are indicated in Table VIII.

A) Proteins which are Closely Related to Known Proteins

Protein of SEQ ID NO:120 (internal designation 26-44-1-B5-CL31)

The protein of SEQ ID NO:120 encoded by the extended cDNA SEQ ID NO: 75isolated from ovary shows extensive homology to a human protein calledphospholemman or PLM and its homologues in rodent and canine species.PLM is encoded by the nucleic acid sequence of Genbank accession numberU72245 and has the amino acid sequence of SEQ ID NO: 180. Phospholemmanis a prominent plasma membrane protein whose phosphorylation correlateswith an increase in contractility of myocardium and skeletal muscle.Initially described as a simple chloride channel, it has recently beenshown to be a channel for taurine that acts as an osmolyte in theregulation of cell volume (Moorman et al, Adv Exp. Med. Biol.,442:219-228 (1998)).

As shown by the alignment in FIG. 10 between the protein of SEQ IDNO:120 and PLM, the amino acid residues are identical except forpositions 3 and 5 in the 92 amino acid long matched protein. Thesubstitution of a proline residue at position 3 par another neutralresidue, serine, is conservative. In addition, the protein of theinvention also exhibits the typical ATP1G/PLM/MAT8 PROSITE signature(position 27 to 40 in bold in FIG. 10) for a family containing mostlyproteins known to be either chloride channels or chloride channelregulators In addition, the protein of invention contains 2 shorttransmembrane segments from positions 1 to 21 and from 37 to 57 aspredicted by the software TOPPRED II (Claros and von Heijne, CABIOSapplic. Notes, 10:685-686 (1994)). The first segment (in italic)corresponds to the signal peptide of PLM and the second transmembranedomains (underlined) matches the transmembrane region(double-underlined) shown to be the chloride channel itself (Chen etal., Circ. Res., 82:367-374 (1998)).

Taken together, these data suggest that the protein of SEQ ID NO 120 maybe involved in the regulation of cell volume and in tissuecontractility. Thus, this protein may be useful in diagnosing and/ortreating several types of disorders including, but not limited to,cancer, diarrhea, fertility disorders, and in contractility disordersincluding muscle disorders, pulmonary disorders and myocardialdisorders.

Proteins of SEQ ID NOs: 121 (Internal Designation 47-4-4-C6-CL2_(—)3)

The protein of SEQ ID NO:121 encoded by the extended cDNA SEQ ID NO: 76found in substantia nigra shows extensive homology with the human E25protein. The E25 protein is encoded by the nucleic acid sequence ofGenbank accession number AF038953 and has the amino acid sequence of SEQID NO:181. The matched protein might be involved in the development anddifferentiation of haematopoietic stem/progenitor cells. In addition, itis the human homologue of a murine protein thought to be involved inchondro-osteogenic differentiation and belonging to a novel multigenefamily of integral membrane proteins (Deleersnijder et al, J. Biol.Chem., 271:19475-19482 (1996)).

As shown by the alignments in FIG. 11 between the protein of SEQ IDNO:121 and E25, the amino acid residues are identical except forpositions 9, 24 and 121 in the 263 amino acid long matched sequence. Allthese substitutions are conservative. In addition, the protein ofinvention contains one short transmembrane segment from positions 1 to21 (underlined in FIG. 11) matching the one predicted for the murine E25protein as predicted by the software TopPred II (Claros and von Heijne,CABIOS applic. Notes, 10:685-686 (1994)).

Taken together, these data suggest that the protein of SEQ ID NO:121 maybe involved in cellular proliferation and differentiation, and/or inhaematopoiesis. Thus, this protein may be useful in diagnosing and/ortreating several types of disorders including, but not limited to,cancer, hematological, chondro-osteogenic and embryogenetic disorders.

Proteins of SEQ ID NO:128 (Internal Designation 58-34-2-H8-CL1_(—)3)

The protein of SEQ ID NO:128 encoded by the extended cDNA SEQ ID NO: 83isolated from kidney shows extensive homology to the murine WW-domainbinding protein 1 or WWBP-1. WWBP-1 is encoded by the nucleic acidsequence of Genbank accession number U40825 and has the amino acidsequence of SEQ ID NO:182. This protein is expressed in placenta, lung,liver and kidney is thought to play a role in intracellular signaling bybinding to the WW domain of the Yes protooncogene-associated protein viaits so-called PY domain (Chen and Sudol, Proc. Natl. Acad. Sci.,92:7819-7823 (1995)). The WW-PY domains are thought to represent a newset of modular protein-binding sequences just like the SH3—PXXP domains(Sudol et al., FEBS Lett., 369:67-71 (1995)).

As shown by the alignments of FIG. 12 between the protein of SEQ IDNO:128 and WWBP-1, the amino acid residues are identical to those of the305 amino acid long matched protein except for positions 53, 66, 78, 89,92, 94, 96, 100, 102, 106, 110, 113, 124, 128, 136, 139, 140, 142-144,166, 168, 173, 176, 178, 181, 182, 188, 196, 199, 201, 202, 207 and 210of the matched protein. 68% of these substitutions are conservative.Indeed the histidine-rich PY domain is present in the protein of theinvention (positions 82-86 in bold in FIG. 12).

Taken together, these data suggest that the protein of SEQ ID NO:128 mayplay a role in intracellular signaling. Thus, this protein may be usefulin diagnosing and/or treating several types of disorders including, butnot limited to, cancer, neurodegenerative diseases, cardiovasculardisorders, hypertension, renal injury and repair and septic shock.

B) Proteins which are Remotely Related to Proteins with known Functions

Protein of SEQ ID NO: 97 (Internal Designation 108-004-5-O-G6-FL)

The protein SEQ ID NO: 97 found in liver encoded by the extended cDNASEQ ID NO: 52 shows homology to a lectin-like oxidized LDL receptor(LOX-1) found in human, bovine and murine species. Such type II proteinswith a C-lectin-like domain, expressed in vascular endothelium andvascular-rich organs, bind and internalize oxidatively modifiedlow-density lipoproteins (Sawamura et al, Nature, 386:73-77, (1997)).The oxidized lipoproteins have been implicated in the pathogenesis ofatherosclerosis, a leading cause of death in industrialized countries(see review by Parthasarathy et al, Biochem. Pharmacol. 56:279-284(1998)). In addition, type II membrane proteins with a C-terminus C-typelectin domain, also known as carbohydrate-recognition domains, alsoinclude proteins involved in target-cell recognition and cellactivation.

The protein of invention has the typical structure of a type II protein.belonging to the C-type lectin family. Indeed, it contains a short31-amino-acid-long N-terminal tail, a transmembrane segment frompositions 32 to 52 matching the one predicted for human LOX-1 and alarge 177-amino-acid-long C-terminal tail as predicted by the softwareTopPred II (Claros and von Heijne, CABIOS applic. Notes, 10:685-686(1994)). All six cysteines of LOX-1 C-type lectin domain are alsoconserved in the protein of the invention (positions 102, 113, 130, 195,208 and 216) although the characteristic PROSITE signature of thisfamily is not. The LOX-1 protein is encoded by the nucleic acid sequenceof Genbank accession number: AB010710.

Taken together, these data suggest that the protein of SEQ ID NO: 97 maybe involved in the metabolism of lipids and/or in cell-cell orcell-matrix interactions and/or in cell activation. Thus, this proteinor part therein, may be useful in diagnosing and treating severaldisorders including, but not limited to, cancer, hyperlipidaemia,cardiovascular disorders and neurodegenerative disorders.

Protein of SEQ ID NO:111 (Internal Designation 108-008-5-O-G12-FL)

The protein SEQ ID NO:111 encoded by the extended cDNA SEQ ID NO:66shows homology to a mitochondrial protein found in SaccharomycesCerevisiae (PIR:S72254) which is similar to E. Coli ribosomal proteinL36. The typical PROSITE signature for ribosomal L36 is present in theprotein of the invention (positions 76-102) except for a substitution ofa tryptophane residue instead of a valine, leucine, isoleucine,methionine or asparagine residue.

Taken together, these data suggest that the protein of SEQ ID NO: 111may be involved in protein biosynthesis. Thus, this protein may beuseful in diagnosing and/or treating several types of disordersincluding, but not limited to, cancer.

Protein of SEQ ID NO: 94 (Internal Designation 108-004-5-0-D10-FL)

The protein SEQ ID NO: 94 encoded by the extended cDNA SEQ ID NO: 49shows remote homology to a subfamily of beta4-galactosyltransferaseswidely conserved in animals (human, rodents, cow and chicken). Suchenzymes, usually type II membrane proteins located in the endoplasmicreticulum or in the Golgi apparatus, catalyzes the biosynthesis ofglycoproteins, glycolipid glycans and lactose. Their characteristicfeatures defined as those of subfamily A in Breton et al, J. Biochem.,123:1000-1009 (1998) are pretty well conserved in the protein of theinvention, especially the region I containing the DVD motif (positions163-165) thought to be involved either in UDP binding or in thecatalytic process itself.

In addition, the protein of invention has the typical structure of atype II protein. Indeed, it contains a short 28-amino-acid-longN-terminal tail, a transmembrane segment from positions 29 to 49 and alarge 278-amino-acid-long C-terminal tail as predicted by the softwareTopPred II (Claros and von Heijne, CABIOS applic. Notes, 10:685-686(1994)).

Taken together, these data suggest that the protein of SEQ ID NO: 94 mayplay a role in the biosynthesis of polysaccharides, and of thecarbohydrate moieties of glycoproteins and glycolipids and/or incell-cell recognition. Thus, this protein may be useful in diagnosingand/or treating several types of disorders including, but not limitedto, cancer, atherosclerosis, cardiovascular disorders, autoimmunedisorders and rheumatic diseases including rheumatoid arthritis.

Protein of SEQ ID NO:104 (Internal Designation 108-006-5-0-G2-FL)

The protein of SEQ ID NO:104 encoded by the extended cDNA SEQ ID NO: 59shows homology to a neuronal murine protein NP15.6 whose expression isdevelopmentally regulated. NP15.6 protein is encoded by the nucleic acidsequence of Genbank accession number Y08702.

Taken together, these data suggest that the protein of SEQ ID NO:104 maybe involved in cellular proliferation and differentiation. Thus, thisprotein may be useful in diagnosing and/or treating several types ofdisorders including, but not limited to, cancer, neurodegenerativedisorders and embryogenetic disorders.

C) Proteins Homologous to a Domain of a Protein with known Function

Protein of SEQ ID NO:113 (Internal Designation 108-009-5-0-A2-FL)

The protein of SEQ ID NO:113 encoded by the extended cDNA SEQ ID NO: 68shows extensive homology to the bZIP family of transcription factors,and especially to the human luman protein. (Lu et al., Mol. Cell. Biol.,17:5117-5126 (1997)). The human luman protein is encoded by the nucleicacid sequence of Genbank accession number: AF009368. The match includethe whole bZIP domain composed of a basic DNA-binding domain and of aleucine zipper allowing protein dimerization. The basic domain isconserved in the protein of the invention as shown by the characteristicPROSITE signature (positions 224-237) except for a conservativesubstitution of a glutamic acid with an aspartic acid in position 233.The typical PROSITE signature for leucine zipper is also present(positions 259 to 280). Secreted proteins may have nucleic acid bindingdomain as shown by a nematode protein thought to regulate geneexpression which exhibits zinc fingers as well as a functional signalpeptide (Holst and Zipfel, J. Biol. Chem., 271:16275-16733, 1996).

Taken together, these data suggest that the protein of SEQ ID NO:113 maybind to DNA, hence regulating gene expression as a transcription factor.Thus, this protein may be useful in diagnosing and/or treating severaltypes of disorders including, but not limited to, cancer.

Proteins of SEQ ID NO:129 (Internal Designation 76-13-3-A9-CL11)

The protein of SEQ ID NO:129 encoded by the extended cDNA SEQ ID NO: 84shows homology with part of a human seven transmembrane protein. Thehuman seven transmembrane protein is encoded by the nucleic acidsequence of Genbank accession number Y11395. The matched proteinpotentially associated to stomatin may act as a G-protein coupledreceptor and is likely to be important for the signal transduction inneurons and haematopoietic cells (Mayer et al, Biochem. Biophys. Acta.,1395:301-308 (1998)).

Taken together, these data suggest that the protein of SEQ ID NO:129 maybe involved in signal transduction. Thus, this protein may be useful indiagnosing and/or treating several types of disorders including, but notlimited to, cancer, neurodegenerative diseases, cardiovasculardisorders, hypertension, renal injury and repair and septic shock.

Proteins of SEQ ID NO: 95 (Internal Designation 108-004-5-0-E8-FL)

The protein of SEQ ID NO: 95 encoded by the extended cDNA SEQ ID NO: 50exhibit the typical PROSITE signature for amino acid permeases(positions 5 to 66) which are integral membrane proteins involved in thetransport of amino acids into the cell. In addition, the protein ofinvention has a transmembrane segment from positions 9 to 29 aspredicted by the software TopPred II (Claros and von Heijne, CABIOSapplic. Notes, 10:685-686 (1994)).

Taken together, these data suggest that the protein of SEQ ID NO: 95 maybe involved in amino acid transport. Thus, this protein may be useful indiagnosing and/or treating several types of disorders including, but notlimited to, cancer, aminoacidurias, neurodegenerative diseases,anorexia, chronic fatigue, coronary vascular disease, diphtheria,hypoglycemia, male infertility, muscular and myopathies.

As discussed above, the extended cDNAs of the present invention orportions thereof can be used for various purposes. The polynucleotidescan be used to express recombinant protein for analysis,characterization or therapeutic use; as markers for tissues in which thecorresponding protein is preferentially expressed (either constitutivelyor at a particular stage of tissue differentiation or development or indisease states); as molecular weight markers on Southern gels; aschromosome markers or tags (when labeled) to identify chromosomes or tomap related gene positions; to compare with endogenous DNA sequences inpatients to identify potential genetic disorders; as probes to hybridizeand thus discover novel, related DNA sequences; as a source ofinformation to derive PCR primers for genetic fingerprinting; forselecting and making oligomers for attachment to a “gene chip” or othersupport, including for examination for expression patterns; to raiseanti-protein antibodies using DNA immunization techniques; and as anantigen to raise anti-DNA antibodies or elicit another immune response.Where the polynucleotide encodes a protein which binds or potentiallybinds to another protein (such as, for example, in a receptor-ligandinteraction), the polynucleotide can also be used in interaction trapassays (such as, for example, that described in Gyuris et al., Cell75:791-803 (1993)) to identify polynucleotides encoding the otherprotein with which binding occurs or to identify inhibitors of thebinding interaction.

The proteins or polypeptides provided by the present invention cansimilarly be used in assays to determine biological activity, includingin a panel of multiple proteins for high-throughput screening; to raiseantibodies or to elicit another immune response; as a reagent (includingthe labeled reagent) in assays designed to quantitatively determinelevels of the protein (or its receptor) in biological fluids; as markersfor tissues in which the corresponding protein is preferentiallyexpressed (either constitutively or at a particular stage of tissuedifferentiation or development or in a disease state); and, of course,to isolate correlative receptors or ligands. Where the protein binds orpotentially binds to another protein (such as, for example, in areceptor-ligand interaction), the protein can be used to identify theother protein with which binding occurs or to identify inhibitors of thebinding interaction. Proteins involved in these binding interactions canalso be used to screen for peptide or small molecule inhibitors oragonists of the binding interaction.

Any or all of these research utilities are capable of being developedinto reagent grade or kit format for commercialization as researchproducts.

Methods for performing the uses listed above are well known to thoseskilled in the art. References disclosing such methods include withoutlimitation “Molecular Cloning; A Laboratory Manual”, 2d ed., Cole SpringHarbor Laboratory Press, Sambrook, J., E. F. Fritsch and T. Maniatiseds., 1989, and “Methods in Enzymology; Guide to Molecular CloningTechniques”, Academic Press, Berger, S. L. and A. R. Kimmel eds., 1987.

Polynucleotides and proteins of the present invention can also be usedas nutritional sources or supplements. Such uses include withoutlimitation use as a protein or amino acid supplement, use as a carbonsource, use as a nitrogen source and use as a source of carbohydrate. Insuch cases the protein or polynucleotide of the invention can be addedto the feed of a particular organism or can be administered as aseparate solid or liquid preparation, such as in the form of powder,pills, solutions, suspensions or capsules. In the case ofmicroorganisms, the protein or polynucleotide of the invention can beadded to the medium in or on which the microorganism is cultured.

Although this invention has been described in terms of certain preferredembodiments, other embodiments which will be apparent to those ofordinary skill in the art in view of the disclosure herein are alsowithin the scope of this invention. Accordingly, the scope of theinvention is intended to be defined only by reference to the appendedclaims. All documents cited herein are incorporated herein by referencein their entirety. TABLE I SEQ ID NO. in SEQ ID NO. in ProvisionalPresent Application Provisional Application Disclosing SequenceApplication 40 U.S. Application No. 60/096,116, filed on Aug. 10, 199840 41 U.S. Application No. 60/096,116, filed on Aug. 10, 1998 41 42 U.S.Application No. 60/099,273, filed on Sep. 4, 1998 62 43 U.S. ApplicationNo. 60/099,273, filed on Sep. 4, 1998 47 44 U.S. Application No.60/099,273, filed on Sep. 4, 1998 43 45 U.S. Application No. 60/096,116,filed on Aug. 10, 1998 42 46 U.S. Application No. 60/096,116, filed onAug. 10, 1998 43 47 U.S. Application No. 60/099,273, filed on Sep. 4,1998 45 48 U.S. Application No. 60/099,273, filed on Sep. 4, 1998 44 49U.S. Application No. 60/099,273, filed on Sep. 4, 1998 50 50 U.S.Application No. 60/099,273, filed on Sep. 4, 1998 49 51 U.S. ApplicationNo. 60/096,116, filed on Aug. 10, 1998 44 52 U.S. Application No.60/096,116, filed on Aug. 10, 1998 45 53 U.S. Application No.60/096,116, filed on Aug. 10, 1998 46 54 U.S. Application No.60/099,273, filed on Sep. 4, 1998 51 55 U.S. Application No. 60/099,273,filed on Sep. 4, 1998 59 56 U.S. Application No. 60/099,273, filed onSep. 4, 1998 61 57 U.S. Application No. 60/099,273, filed on Sep. 4,1998 53 58 U.S. Application No. 60/099,273, filed on Sep. 4, 1998 52 59U.S. Application No. 60/099,273, filed on Sep. 4, 1998 54 60 U.S.Application No. 60/096,116, filed on Aug. 10, 1998 47 61 U.S.Application No. 60/099,273, filed on Sep. 4, 1998 63 62 U.S. ApplicationNo. 60/099,273, filed on Sep. 4, 1998 46 63 U.S. Application No.60/096,116, filed on Aug. 10, 1998 48 64 U.S. Application No.60/099,273, filed on Sep. 4, 1998 58 65 U.S. Application No. 60/099,273,filed on Sep. 4, 1998 56 66 U.S. Application No. 60/096,116, filed onAug. 10, 1998 49 67 U.S. Application No. 60/099,273, filed on Sep. 4,1998 57 68 U.S. Application No. 60/099,273, filed on Sep. 4, 1998 55 69U.S. Application No. 60/099,273, filed on Sep. 4, 1998 42 70 U.S.Application No. 60/099,273, filed on Sep. 4, 1998 41 71 U.S. ApplicationNo. 60/099,273, filed on Sep. 4, 1998 48 72 U.S. Application No.60/099,273, filed on Sep. 4, 1998 60 73 U.S. Application No. 60/096,116,filed on Aug. 10, 1998 50 74 U.S. Application No. 60/099,273, filed onSep. 4, 1998 40 75 U.S. Application No. 60/074,121, filed on Feb. 9,1998 42 76 U.S. Application No. 60/074,121, filed on Feb. 9, 1998 56 77U.S. Application No. 60/074,121, filed on Feb. 9, 1998 57 78 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 84 79 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 69 80 U.S.Application No. 60/074,121, filed on Feb. 9, 1998 62 81 U.S. ApplicationNo. 60/081,563, filed on Apr. 13, 1998 79 82 U.S. Application No.60/074,121, filed on Feb. 9, 1998 64 83 U.S. Application No. 60/081,563,filed on Apr. 13, 1998 51 84 U.S. Application No. 60/074,121, filed onFeb. 9, 1998 71 130 U.S. Application No. 60/081,563, filed on Apr. 13,1998 40 131 U.S. Application No. 60/081,563, filed on Apr. 13, 1998 41132 U.S. Application No. 60/081,563, filed on Apr. 13, 1998 42 133 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 43 134 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 44 135 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 45 136 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 46 137 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 47 138 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 48 139 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 49 140 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 50 141 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 53 142 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 54 143 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 55 144 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 56 145 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 57 146 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 58 147 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 59 148 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 60 149 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 61 150 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 62 151 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 63 152 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 64 153 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 65 154 U.S.Application No. 60/081,563, filed on Apr. 13, 1998 66

TABLE II Parameters used for each step of EST analysis SelectionCharacteristics Iden- Search Characteristics tity Length Step ProgramStrand Parameters (%)) (bp) Miscellaneous Blastn both S = 61 X = 16 9017 tRNA Fasta both — 80 60 rRNA Blastn both S = 108 80 40 mtRNA Blastnboth S = 108 80 40 Procaryotic Blastn both S = 144 90 40 Fungal Blastnboth S = 144 90 40 Alu fasta* both — 70 40 L1 Blastn both S = 72 70 40Repeats Blastn both S = 72 70 40 Promoters Blastn top S = 54 X = 16 90 15† Vertebrate fasta* both S = 108 90 30 ESTs Blatsn both S = 108 X =16 90 30 Proteins blastx

top E = 0.001 — —*use “Quick Fast” Database Scanner†alignment further constrained to begin closer than 10 bp to EST\5′ end

using BLOSUM62 substitution matrix

TABLE III Parameters used for each step of extended cDNA analysis SearchSelection characteristics characteristics Identity Length Step ProgramStrand Parameters (%) (bp) Comments miscellaneous* FASTA both — 90 15tRNA^($) FASTA both — 80 90 rRNA^($) BLASTN both S = 108 80 40 mtRNA^($)BLASTN both S = 108 80 40 Procaryotic^($) BLASTN both S = 144 90 40Fungal* BLASTN both S = 144 90 40 Alu* BLASTN both S = 72 70 40 max 5matches, masking L1^($) BLASTN both S = 72 70 40 max 5 matches, maskingRepeats^($) BLASTN both S = 72 70 40 Masking PolyA BLAST2N top W = 6, S= 10, 90  8 in the last 20 E = 1000 nucleotides Polyadenylation — topAATAAA allowing 1 mismatch in the 50 signal nucleotides preceding the 5′end of the polA Vertibrate* BLASTN both — 90 then 30 first BLASTN andthen 70 then FASTA on FASTA matching sequences ESTs* BLAST2N both — 9030 Geneseq BLASTN both W = 8, B = 10 90 30 ORF BLASTP top W = 8, B = 10— — on ORF proteins, max 10 matches Proteins* BLASTX top E = 0.001 70 30^($)steps common to EST analysis and using the same algorithms andparameters*steps also used in EST analysis but with different algorithms and/orparameters

TABLE IV Mature Stop FCS SigPep Polypeptide Codon PolyA Signal PolyASite Id Location Location Location Location Location Location 40 35through 35 through 100 101 through 569 667 through 685 through 568 568672 699 41 68 through 68 through 124 125 through 338 462 through 482through 337 337 467 497 42 39 through 39 through 83 84 through 414 566through 583 through 413 413 571 598 43 235 through 235 through 337through 643 1540 through 1564 through 642 336 642 1545 1579 44 42through 42 through 200 201 through 756 860 through 878 through 755 755865 893 45 23 through 23 through 235 236 through 341 611 through 629through 340 340 616 644 46 12 through 12 through 263 264 through 381 —523 through 380 380 538 47 8 through 232 8 through 154 155 through 233 —737 through 232 752 48 183 through 183 through 303 through 423 505through 523 through 422 302 422 510 537 49 24 through 24 through 170 171through 1005 — 1586 through 1004 1004 1602 50 80 through 80 through 139140 through 785 910 through 933 through 784 784 915 948 51 67 through 67through 159 160 through 223 — 673 through 222 222 687 52 46 through 46through 186 187 through 733 781 through 806 through 732 732 786 821 5381 through 81 through 152 153 through 357 406 through 429 through 356356 411 445 54 72 through 72 through 140 141 through 1347 1482 through1502 through 1346 1346 1487 1517 55 194 through 194 through 380 through455 — 1545 through 454 379 454 1560 56 48 through 48 through 347 348through 495 1031 through 1051 through 494 494 1036 1066 57 111 through111 through 216 through 672 990 through 1045 through 671 215 671 9951061 58 5 through 373 5 through 82 83 through 374 1986 through 2010through 373 1991 2025 59 14 through 14 through 319 320 through 473 555through 576 through 472 472 560 591 60 2 through 217 — 2 through 217 218489 through 529 through 494 544 61 51 through 51 through 110 111 through576 1653 through 1674 through 575 575 1658 1689 62 69 through 69 through128 129 through 978 1076 through 1096 through 977 977 1081 1111 63 44through 44 through 160 161 through 239 443 through 540 through 238 238448 554 64 114 through 114 through 165 through 525 1739 through 1758through 524 164 524 1744 1773 65 26 through 26 through 64 65 through 488883 through 901 through 487 487 888 917 66 80 through 80 through 187 188through 389 609 through 627 through 388 388 614 641 67 186 through 186through 408 through 444 827 through 839 through 443 407 443 832 854 6875 through 75 through 1005 through 1260 1536 through 1553 through 12591004 1259 1541 1568 69 98 through 98 through 151 152 through 377 471through 491 through 376 376 476 506 70 72 through 72 through 134 135through 255 506 through 528 through 254 254 511 542 71 148 through 148through 241 through 1141 1590 through 1614 through 1140 240 1140 15951629 72 109 through 109 through 406 through 739 1633 through 1650through 738 405 738 1638 1665 73 55 through 55 through 255 256 through292 390 through 410 through 291 291 395 425 74 25 through — 25 through277 508 through 533 through 276 276 513 546 75 32 through 32 through 9192 through 308 452 through 472 through 307 307 457 485 76 46 through 46through 87 88 through 676 1363 through 1382 through 675 675 1368 1394 77329 through 329 through 746 through 944 — 1322 through 943 745 943 133378 27 through 27 through 77 78 through 282 — — 281 281 79 61 through 61through 213 214 through 406 675 through 692 through 405 405 680 703 80137 through 137 through 230 through 380 728 through 755 through 379 229379 733 768 81 37 through 37 through 153 154 through 742 969 through 994through 741 741 974 1007 82 80 through 80 through 142 143 through 266491 through 517 through 265 265 496 527 83 612 through — 612 through 645829 through 850 through 644 644 834 861 84 61 through 61 through 162 163through 229 208 through — 228 228 213 130 15 through 15 through 110 111through 312 507 through 531 through 311 311 512 542 131 50 through 50through 130 131 through 530 877 through 899 through 529 529 882 909 132240 through 240 through 306 through 417 1117 through 1139 through 416305 416 1122 1149 133 111 through 111 through 255 through 447 890through 909 through 446 254 446 895 921 134 123 through 123 through 291through 456 886 through 904 through 455 290 455 891 916 135 2 through433 2 through 232 233 through 434 488 through 510 through 433 493 520136 34 through 34 through 87 88 through 364 536 through 558 through 363363 541 568 137 50 through 50 through 157 158 through 287 385 through405 through 286 286 390 416 138 50 through 50 through 151 152 through638 — 1277 through 637 637 1289 139 72 through 72 through 125 126through 603 — 704 through 602 602 715 140 120 through 120 through 186through 435 899 through 918 through 434 185 434 904 931 141 4 through447 4 through 147 148 through 448 858 through 880 through 447 863 891142 28 through 28 through 96 97 through 805 — 806 through 804 804 817143 27 through 27 through 212 213 through 360 988 through 1009 through359 359 993 1020 144 25 through 25 through 93 94 through 958 1368through 1388 through 957 957 1373 1399 145 47 through 47 through 226 227through 320 — 656 through 319 319 666 146 80 through 80 through 130 131through 941 1101 through 1119 through 940 940 1106 1130 147 146 through146 through 293 through 458 442 through 465 through 457 292 457 447 475148 100 through 100 through 208 through 352 — 940 through 351 207 351949 149 177 through 177 through 237 through 570 — 931 through 569 236569 939 150 67 through 67 through 135 136 through 460 856 through 875through 459 459 861 887 151 65 through 65 through 112 113 through 10701978 through 1999 through 1069 1069 1983 2010 152 70 through 70 through234 235 through 322 364 through 375 through 321 321 369 387 153 38through 38 through 91 92 through 878 947 through 974 through 877 877 952983 154 51 through 51 through 203 204 through 471 1585 through 1604through 470 470 1590 1614

TABLE V Full Length Polypeptide Signal Peptide Mature Polypeptide IdLocation Location Location 85 −22 through 156 −22 through −1 1 through156 86 −19 through 71 −19 through −1 1 through 71 87 −15 through 110 −15through −1 1 through 110 88 −34 through 102 −34 through −1 1 through 10289 −53 through 185 −53 through −1 1 through 185 90 −71 through 35 −71through −1 1 through 35 91 −84 through 39 −84 through −1 1 through 39 92−49 through 26 −49 through −1 1 through 26 93 −40 through 40 −40 through−1 1 through 40 94 −49 through 278 −49 through −1 1 through 278 95 −20through 215 −20 through −1 1 through 215 96 −31 through 21 −31 through−1 1 through 21 97 −47 through 182 −47 through −1 1 through 182 98 −24through 68 −24 through −1 1 through 68 99 −23 through 402 −23 through −11 through 402 100 −62 through 25 −62 through −1 1 through 25 101 −100through 49 −100 through −1 1 through 49 102 −35 through 152 −35 through−1 1 through 152 103 −26 through 97 −26 through −1 1 through 97 104 −102through 51 −102 through −1 1 through 51 105 1 through 72 — 1 through 72106 −20 through 155 −20 through −1 1 through 155 107 −20 through 283 −20through −1 1 through 283 108 −39 through 26 −39 through −1 1 through 26109 −17 through 120 −17 through −1 1 through 120 110 −13 through 141 −13through −1 1 through 141 111 −36 through 67 −36 through −1 1 through 67112 −74 through 12 −74 through −1 1 through 12 113 −310 through 85 −310through −1 1 through 85 114 −18 through 75 −18 through −1 1 through 75115 −21 through 40 −21 through −1 1 through 40 116 −31 through 300 −31through −1 1 through 300 117 −99 through 111 −99 through −1 1 through111 118 −67 through 12 −67 through −1 1 through 12 119 1 through 84 — 1through 84 120 −20 through 72 −20 through −1 1 through 72 121 −14through 196 −14 through −1 1 through 196 122 −139 through 66 −139through −1 1 through 66 123 −17 through 68 −17 through −1 1 through 68124 −51 through 64 −51 through −1 1 through 64 125 −31 through 50 −31through −1 1 through 50 126 −39 through 196 −39 through −1 1 through 196127 −21 through 41 −21 through −1 1 through 41 128 1 through 11 — 1through 11 129 −34 through 22 −34 through −1 1 through 22 155 −32through 67 −32 through −1 1 through 67 156 −27 through 133 −27 through−1 1 through 133 157 −22 through 37 −22 through −1 1 through 37 158 −48through 64 −48 through −1 1 through 64 159 −56 through 55 −56 through −11 through 55 160 −77 through 67 −77 through −1 1 through 67 161 −18through 92 −18 through −1 1 through 92 162 −36 through 43 −36 through −11 through 43 163 −34 through 162 −34 through −1 1 through 162 164 −18through 159 −18 through −1 1 through 159 165 −22 through 83 −22 through−1 1 through 83 166 −48 through 100 −48 through −1 1 through 100 167 −23through 236 −23 through −1 1 through 236 168 −62 through 49 −62 through−1 1 through 49 169 −23 through 288 −23 through −1 1 through 288 170 −60through 31 −60 through −1 1 through 31 171 −17 through 270 −17 through−1 1 through 270 172 −49 through 55 −49 through −1 1 through 55 173 −36through 48 −36 through −1 1 through 48 174 −20 through 111 −20 through−1 1 through 111 175 −23 through 108 −23 through −1 1 through 108 176−16 through 319 −16 through −1 1 through 319 177 −55 through 29 −55through −1 1 through 29 178 −18 through 262 −18 through −1 1 through 262179 −51 through 89 −51 through −1 1 through 89

TABLE VI Id Collection refs Deposit Name 40 ATCC# 98921 SignalTag121-144 41 ATCC# 98921 SignalTag 121-144 42 ATCC# 98919 SignalTag145-165 43 ATCC# 98919 SignalTag 145-165 44 ATCC# 98919 SignalTag145-165 45 ATCC# 98921 SignalTag 121-144 46 ATCC# 98921 SignalTag121-144 47 ATCC# 98919 SignalTag 145-165 48 ATCC# 98919 SignalTag145-165 49 ATCC# 98919 SignalTag 145-165 50 ATCC# 98919 SignalTag145-165 51 ATCC# 98921 SignalTag 121-144 52 ATCC# 98921 SignalTag121-144 53 ATCC# 98921 SignalTag 121-144 54 ATCC# 98919 SignalTag145-165 55 ATCC# 98919 SignalTag 145-165 56 ATCC# 98919 SignalTag145-165 57 ATCC# 98919 SignalTag 145-165 58 ATCC# 98919 SignalTag145-165 59 ATCC# 98919 SignalTag 145-165 60 ATCC# 98921 SignalTag121-144 61 ATCC# 98919 SignalTag 145-165 62 ATCC# 98919 SignalTag145-165 63 ATCC# 98921 SignalTag 121-144 64 ATCC# 98919 SignalTag145-165 65 ATCC# 98919 SignalTag 145-165 66 ATCC# 98921 SignalTag121-144 67 ATCC# 98919 SignalTag 145-165 68 ATCC# 98919 SignalTag145-165 69 ATCC# 98919 SignalTag 145-165 70 ATCC# 98919 SignalTag145-165 71 ECACC# XXXX SignalTag 28011 999 72 ECACC# XXXX SignalTag28011 999 73 ECACC# XXXX SignalTag 28011 999 74 ECACC# XXXX SignalTag28011 999 75 ECACC# XXXX SignalTag 28011 999 76 ECACC# XXXX SignalTag28011 999 77 ECACC# XXXX SignalTag 28011 999 78 ECACC# XXXX SignalTag28011 999 79 ECACC# XXXX SignalTag 28011 999 80 ECACC# XXXX SignalTag28011 999 81 ECACC# XXXX SignalTag 28011 999 82 ECACC# XXXX SignalTag28011 999 83 ECACC# XXXX SignalTag 28011 999 84 ECACC# XXXX SignalTag28011 999

TABLE VII Internal designation Id Type of sequence 108-002-5-0-B1-FL 40DNA 108-002-5-0-F3-FL 41 DNA 108-002-5-0-F4-FL 42 DNA 108-003-5-0-A8-FL43 DNA 108-003-5-0-D2-FL 44 DNA 108-003-5-0-E5-FL 45 DNA108-003-5-0-H2-FL 46 DNA 108-004-5-0-B7-FL 47 DNA 108-004-5-0-C8-FL 48DNA 108-004-5-0-D10-FL 49 DNA 108-004-5-0-E8-FL 50 DNA 108-004-5-0-F5-FL51 DNA 108-004-5-0-G6-FL 52 DNA 108-005-5-0-B11-FL 53 DNA108-005-5-0-C1-FL 54 DNA 108-005-5-0-F11-FL 55 DNA 108-005-5-0-F6-FL 56DNA 108-006-5-0-C2-FL 57 DNA 108-006-5-0-E6-FL 58 DNA 108-006-5-0-G2-FL59 DNA 108-006-5-0-G4-FL 60 DNA 108-008-5-0-A6-FL 61 DNA108-008-5-0-A8-FL 62 DNA 108-008-5-0-C10-FL 63 DNA 108-008-5-0-E6-FL 64DNA 108-008-5-0-F6-FL 65 DNA 108-008-5-0-G12-FL 66 DNA 108-008-5-0-G4-FL67 DNA 108-009-5-0-A2-FL 68 DNA 108-013-5-0-C12-FL 69 DNA108-013-5-0-G11-FL 70 DNA 108-003-5-0-E4-FL 71 DNA 108-005-5-0-D6-FL 72DNA 108-008-5-0-G3-FL 73 DNA 108-013-5-0-B5-FL 74 DNA 26-44-1-B5-CL3_175 DNA 47-4-4-C6-CL2_3 76 DNA 47-40-4-G9-CL1_1 77 DNA 48-25-4-D8-CL1_778 DNA 48-28-3-A9-CL0_1 79 DNA 51-25-1-A2-CL3_1 80 DNA 55-10-3-F5-CL0_381 DNA 57-19-2-G8-CL1_3 82 DNA 58-34-2-H8-CL1_3 83 DNA 76-13-3-A9-CL1_184 DNA 78-7-2-B8-FL1 130 DNA 77-8-4-F9-FL1 131 DNA 58-8-1-F2-FL2 132 DNA77-13-1-A7-FL2 133 DNA 47-2-3-G9-FL1 134 DNA 33-75-4-H7-FL1 135 DNA51-41-1-F10-FL1 136 DNA 48-51-4-C11-FL1 137 DNA 33-58-3-C8-FL1 138 DNA76-20-4-C11-FL1 139 DNA 76-28-3-A12-FL1 140 DNA 76-25-4-F11-FL1 141 DNA58-20-4-G7-FL1 142 DNA 33-54-1-B9-FL1 143 DNA 76-20-3-H1-FL1 144 DNA47-20-2-G3-FL1 145 DNA 78-25-1-H11-FL1 146 DNA 78-6-2-B10-FL1 147 DNA58-49-3-G10-FL1 148 DNA 78-21-1-B7-FL1 149 DNA 57-28-4-B12-FL1 150 DNA33-77-4-E2-FL1 151 DNA 58-19-3-D3-FL2 152 DNA 37-7-4-E7-FL1 153 DNA60-14-2-H10-FL1 154 DNA 108-002-5-0-B1-FL 85 PRT 108-002-5-0-F3-FL 86PRT 108-002-5-0-F4-FL 87 PRT 108-003-5-0-A8-FL 88 PRT 108-003-5-0-D2-FL89 PRT 108-003-5-0-E5-FL 90 PRT 108-003-5-0-H2-FL 91 PRT108-004-5-0-B7-FL 92 PRT 108-004-5-0-C8-FL 93 PRT 108-004-5-0-D10-FL 94PRT 108-004-5-0-E8-FL 95 PRT 108-004-5-0-F5-FL 96 PRT 108-004-5-0-G6-FL97 PRT 108-005-5-0-B11-FL 98 PRT 108-005-5-0-C1-FL 99 PRT108-005-5-0-F11-FL 100 PRT 108-005-5-0-F6-FL 101 PRT 108-006-5-0-C2-FL102 PRT 108-006-5-0-E6-FL 103 PRT 108-006-5-0-G2-FL 104 PRT108-006-5-0-G4-FL 105 PRT 108-008-5-0-A6-FL 106 PRT 108-008-5-0-A8-FL107 PRT 108-008-5-0-C10-FL 108 PRT 108-008-5-0-E6-FL 109 PRT108-008-5-0-F6-FL 110 PRT 108-008-5-0-G12-FL 111 PRT 108-008-5-0-G4-FL112 PRT 108-009-5-0-A2-FL 113 PRT 108-013-5-0-C12-FL 114 PRT108-013-5-0-G11-FL 115 PRT 108-003-5-0-E4-FL 116 PRT 108-005-5-0-D6-FL117 PRT 108-008-5-0-G3-FL 118 PRT 108-013-5-0-B5-FL 119 PRT26-44-1-B5-CL3_1 120 PRT 47-4-4-C6-CL2_3 121 PRT 47-40-4-G9-CL1_1 122PRT 48-25-4-D8-CL1_7 123 PRT 48-28-3-A9-CL0_1 124 PRT 51-25-1-A2-CL3_1125 PRT 55-10-3-F5-CL0_3 126 PRT 57-19-2-G8-CL1_3 127 PRT58-34-2-H8-CL1_3 128 PRT 76-13-3-A9-CL1_1 129 PRT 78-7-2-B8-FL1 155 PRT77-8-4-F9-FL1 156 PRT 58-8-1-F2-FL2 157 PRT 77-13-1-A7-FL2 158 PRT47-2-3-G9-FL1 159 PRT 33-75-4-H7-FL1 160 PRT 51-41-1-F10-FL1 161 PRT48-51-4-C11-FL1 162 PRT 33-58-3-C8-FL1 163 PRT 76-20-4-C11-FL1 164 PRT76-28-3-A12-FL1 165 PRT 76-25-4-F11-FL1 166 PRT 58-20-4-G7-FL1 167 PRT33-54-1-B9-FL1 168 PRT 76-20-3-H1-FL1 169 PRT 47-20-2-G3-FL1 170 PRT78-25-1-H11-FL1 171 PRT 78-6-2-B10-FL1 172 PRT 58-49-3-G10-FL1 173 PRT78-21-1-B7-FL1 174 PRT 57-28-4-B12-FL1 175 PRT 33-77-4-E2-FL1 176 PRT58-19-3-D3-FL2 177 PRT 37-7-4-E7-FL1 178 PRT 60-14-2-H10-FL1 179 PRT

TABLE VIII PROSITE signature Id Locations Name 89 205-226 Leucine zipper95  5-66 Amino acid permease 103 46-67 Leucine zipper 113 259-280Leucine zipper 120 27-40 MAT8 family 122 123-125 Cell attachmentsequence

1. An isolated or purified polypeptide: a) comprising at least 50, 75,100 or 150 consecutive amino acids of SEQ ID NO: 104; b) comprisingamino acids 1 to 51 of SEQ ID NO: 104; c) consisting of amino acids 1 to51 of SEQ ID NO: 104; d) comprising SEQ ID NO: 104; e) consisting of SEQID NO: 104; f) comprising the amino acid sequence encoded by the insertfrom deposited clone 108-006-5-0-G2-FL in ATCC accession number 98921;or g) comprising an amino acid sequence has at least 80%, 85%, 90%, 95%,96%, 97%, 98%, or 99% identity to the polypeptide of SEQ ID NO:104. 2.The polypeptide according to claim 1, wherein said polypeptide comprisesamino acids 1 to 51 of SEQ ID NO:104.
 3. The polypeptide according toclaim 1, wherein said polypeptide consists of amino acids 1 to 51 of SEQID NO:104.
 4. The polypeptide according to claim 1, wherein saidpolypeptide comprises the sequence of SEQ ID NO:104.
 5. The polypeptideaccording to claim 4, wherein said polypeptide consists of an amino acidsequence of SEQ ID NO:104.
 6. The polypeptide according to claim 1,wherein said polypeptide comprises an amino acid sequence encoded by theinsert from deposited clone 108-006-5-0-G2-FL in ATCC accession number98921.
 7. The polypeptide of claim 1, wherein said polypeptide has atleast 80%, 85%, 90%, 95%, 96%, 97%, 98%, or 99% identity to thepolypeptide of SEQ ID NO:104.
 8. The polypeptide of claim 1, whereinsaid polypeptide is a neuronal protein involved in cellularproliferation and differentiation.
 9. A method of making a polypeptidecomprising the steps of: a) obtaining a cDNA encoding a polypeptide: i)comprising at least 50, 75, 100 or 150 consecutive amino acids of SEQ IDNO: 104; ii) comprising amino acids 1 to 51 of SEQ ID NO: 104; iii)consisting of amino acids 1 to 51 of SEQ ID NO: 104; iv) comprising SEQID NO. 104; v) consisting of SEQ ID NO: 104; vi) comprising the aminoacid sequence encoded by the insert from deposited clone108-006-5-O-G2-FL in ATCC accession number 98921; or vii). comprising anamino acid sequence has at least 80%, 85%, 90%, 95%, 96%, 97%, 98%, or99% identity to the polypeptide of SEQ ID NO:104; b) inserting said cDNAin an expression vector such that said cDNA is operably linked to apromoter; and c) introducing said expression vector into a host cellwhereby said host cell produces the protein encoded by said cDNA. 10.The method of claim 9, further comprising the step of isolating saidpolypeptide.